Microsoft Malware Detection

Link to my Detailed YouTube Video Explaining the whole Notebook

IMAGE ALT TEXT

The actual Kaggle Challenge

Identify whether a given piece of file/software is a malware.

In this Notebook, I achieved a test log loss of 0.0070458 with XGBoost

(This Test Dataset is based on splitting only train.7z (which is ~200GB after extraction) into Train, Test and CV )

What is in this kernel

  1. Data Description
  2. Issues I encountered for this large dataset
  3. My Final overall approach to handle this huge dataset
  4. Data Overview
  5. Performance Metric
  6. Machine Learing Objectives and Constraints
  7. Exploratory Data Analysis
  8. Distribution of malware classes in whole data set
  9. File size of byte files as a feature
  10. box plots of file size (.byte files) feature
  11. Uni-Gram Byte Feature extraction from byte files
  12. Multivariate Analysis on byte files
  13. Train Test split of only Byte Files Features
  14. Random Model ONLY on bytes files
  15. K Nearest Neighbour Classification ONLY on bytes files
  16. Logistic Regression ONLY on bytes files
  17. Random Forest Classifier ONLY on bytes files
  18. XgBoost Classification ONLY on bytes files
  19. XgBoost Classification with best hyper parameters using RandomSearch ONLY on bytes files
  20. Modeling with .asm files
  21. Feature extraction from asm files
  22. Files sizes of each .asm file as a feature
  23. Univariate analysis ONLY on .asm file features
  24. Multivariate Analysis ONLY on .asm file features
  25. Conclusion on EDA ( ONLY on .asm file features)
  26. Train and test split ( ONLY on .asm file featues )
  27. K-Nearest Neigbors ONLY on .asm file features
  28. Logistic Regression ONLY on .asm file features
  29. Random Forest Classifier ONLY on .asm file features
  30. XgBoost Classifier ONLY on .asm file features
  31. Xgboost Classifier with best hyperparameters ( ONLY on .asm file features )
  32. FINAL FEATURIZATION STEPS FOR THE FINAL XGBOOST MODEL TRAINING
  33. Uni-Gram Byte Feature extraction from byte files - For FINAL Model Train
  34. File sizes of Byte files - Feature Extraction -For FINAL Model Train
  35. Creating some important Files and Folders, which I shall use later for saving Featuarized versions of .csv files
  36. Merging Unigram of Byte Files + Size of Byte Files to create uni_gram_byte_features__with_size
  37. Bi-Gram Byte Feature extraction from byte files
  38. Extracting the 2000 Most Important Features from Byte bigrams using SelectKBest with Chi-Square Test
  39. ASM Unigram - Top 52 Unigram Features from ASM Files - Final Model Training
  40. File Size of ASM Files - Feature Extraction - Final Model Training
  41. Merging ASM Unigram + ASM File Size
  42. ASM Files - Convert the ASM files to images
  43. Extract the first 800 pixel data from ASM File Images
  44. Extracting Opcodes Bigrams from ASM Files
  45. Calcualte opcodes bigram with above defined function and make them a feature and then save the data matrix of feature as a .csv file
  46. ASM File - Top Important 500 features from Opcodes Bigrams
  47. Opcodes Trigrams ASM Files - Feature extraction
  48. SM File - Top Important 800 features from Opcodes Trigrams
  49. Final Merging of all Features for the Final XGBOOST Training
  50. Final Train Test Split. 64% Train, 16% Cross Validation, 20% Test
  51. Final XGBoost Training - Hyperparameter tuning with on Final Merged Data-Matrix
  52. Final running of XGBoost with the Best HyperParams that we got from above RandomizedSearchCV
  53. Possibiliy of Further Analysis and Featurizition

1.Data Description

Back to the top

You are provided with a set of known malware files representing a mix of 9 different families. Each malware file has an Id, a 20 character hash value uniquely identifying the file, and a Class, an integer representing one of 9 family names to which the malware may belong:

Imgur

For each file, the raw data contains the hexadecimal representation of the file's binary content, without the PE header (to ensure sterility). You are also provided a metadata manifest, which is a log containing various metadata information extracted from the binary, such as function calls, strings, etc. This was generated using the IDA disassembler tool. Your task is to develop the best mechanism for classifying files in the test set into their respective family affiliations.

The dataset contains the following files:

  • train.7z - the raw data for the training set (MD5 hash = 4fedb0899fc2210a6c843889a70952ed)
  • trainLabels.csv - the class labels associated with the training set
  • test.7z - the raw data for the test set (MD5 hash = 84b6fbfb9df3c461ed2cbbfa371ffb43)
  • sampleSubmission.csv - a file showing the valid submission format
  • dataSample.csv - a sample of the dataset to preview before downloading

Here we are provided with raw data and no pre-extracted features were available

Total train dataset consist of 200GB data out of which 50Gb of data is .bytes files and 150GB of data is .asm files:


2. Issues I encountered for this large dataset

Back to the top

Due to the large size of the dataset (500GB), I had real issues fitting the data into memory during runtime (Colab Pro Failed, and I definitely could not run it in Kaggle)

In Kaggle I got out of disk-space when trying to extract only the train.7z file.

The below code started extracting and only at around 5% I got the out of disk error

!pip install py7zr

!python -m py7zr x full_path_of_7z_file

!python -m py7zr x /content/gdrive/MyDrive/MS_Malware_Kaggle_to_Gdrive/train.7z

It could definitely have been done in Google-Cloud or AWS but have not tried these option.


3. My Final overall approach to handle this huge dataset

Back to the top

Therefore, I ONLY extracted train.7z (which is ~200GB after extraction) in my local machine and then split this set into Train, Test and CV set. And from this split dataset, I did my entire analysis on the train and did the validation part on test and cv set.

Further to be able to accomodate it in my local Machine (which is not too high-end )

First did all my calculations and experimentations and featuriazation ONLY on a sample of 50 files (i.e. 50 each from byteFiles and asmFile ) -

After ONLY I saw that all the featuriazation calculations and xgBoost is running on these 50 samples, then only I ran the same notebook on the full Dataset of 200GB with 20,000+ files

And here's my approach for calculating all the featuriazations (both for 50-samples and full-dataset).

  1. Did all the file processing (which are CPU based) part in local machine, it took around 25 to 26 hours.

i.e. This includes calculating the below features in local machine.

  • Unigram of Byte Files + Size of Byte Files + Top 52 Unigram of ASM Files (These are alrady given by AML)

Added following extra features.

  • Size of ASM Files
  • Top 2000 Bi-Gram of Byte files +
  • Top 500 Bigram of Opcodes of ASM Files
  • Top 800 Trigram of Opcodes of ASM Files
  • Top 800 ASM Image Features

After merging all the above features, the merged dataframe that I got, I created to a .csv file from that (i.e. with the regular to_csv() function ).

This .csv file with the final merged dataset was just about 170MB. Then Uploaded this file to google-drive.

And then from Colab just imported that same final merged .csv file and saved that in a pandas dataframe > do train test and cv split on this and >

Now below 2 steps with Colab Pro's Tesla V100 16GB GPU

RandomizedSearchCV for hyper param tuning > and after ran XGBoost with best params.

And these final RandomizedSearchCV and XGBoost took only like 30 mints.


4. Data Overview

Back to the top

</h1>2.1.2. Example Data Point</h3>

.asm file

.text:00401000                                     assume es:nothing, ss:nothing, ds:_data, fs:nothing, gs:nothing
.text:00401000 56                                  push    esi
.text:00401001 8D 44 24 08                             lea     eax, [esp+8]
.text:00401005 50                                  push    eax
.text:00401006 8B F1                                   mov     esi, ecx
.text:00401008 E8 1C 1B 00 00                              call    ??0exception@std@@QAE@ABQBD@Z ; std::exception::exception(char const * const &)
.text:0040100D C7 06 08 BB 42 00                           mov     dword ptr [esi], offset off_42BB08
.text:00401013 8B C6                                   mov     eax, esi
.text:00401015 5E                                  pop     esi
.text:00401016 C2 04 00                                retn    4
.text:00401016                             ; ---------------------------------------------------------------------------
.text:00401019 CC CC CC CC CC CC CC                        align 10h
.text:00401020 C7 01 08 BB 42 00                           mov     dword ptr [ecx], offset off_42BB08
.text:00401026 E9 26 1C 00 00                              jmp     sub_402C51
.text:00401026                             ; ---------------------------------------------------------------------------
.text:0040102B CC CC CC CC CC                              align 10h
.text:00401030 56                                  push    esi
.text:00401031 8B F1                                   mov     esi, ecx
.text:00401033 C7 06 08 BB 42 00                           mov     dword ptr [esi], offset off_42BB08
.text:00401039 E8 13 1C 00 00                              call    sub_402C51
.text:0040103E F6 44 24 08 01                              test    byte ptr [esp+8], 1
.text:00401043 74 09                                   jz      short loc_40104E
.text:00401045 56                                  push    esi
.text:00401046 E8 6C 1E 00 00                              call    ??3@YAXPAX@Z    ; operator delete(void *)
.text:0040104B 83 C4 04                                add     esp, 4
.text:0040104E
.text:0040104E                             loc_40104E:                 ; CODE XREF: .text:00401043j
.text:0040104E 8B C6                                   mov     eax, esi
.text:00401050 5E                                  pop     esi
.text:00401051 C2 04 00                                retn    4
.text:00401051                             ; ---------------------------------------------------------------------------

.bytes file

00401000 00 00 80 40 40 28 00 1C 02 42 00 C4 00 20 04 20
00401010 00 00 20 09 2A 02 00 00 00 00 8E 10 41 0A 21 01
00401020 40 00 02 01 00 90 21 00 32 40 00 1C 01 40 C8 18
00401030 40 82 02 63 20 00 00 09 10 01 02 21 00 82 00 04
00401040 82 20 08 83 00 08 00 00 00 00 02 00 60 80 10 80
00401050 18 00 00 20 A9 00 00 00 00 04 04 78 01 02 70 90
00401060 00 02 00 08 20 12 00 00 00 40 10 00 80 00 40 19
00401070 00 00 00 00 11 20 80 04 80 10 00 20 00 00 25 00
00401080 00 00 01 00 00 04 00 10 02 C1 80 80 00 20 20 00
00401090 08 A0 01 01 44 28 00 00 08 10 20 00 02 08 00 00
004010A0 00 40 00 00 00 34 40 40 00 04 00 08 80 08 00 08
004010B0 10 00 40 00 68 02 40 04 E1 00 28 14 00 08 20 0A
004010C0 06 01 02 00 40 00 00 00 00 00 00 20 00 02 00 04
004010D0 80 18 90 00 00 10 A0 00 45 09 00 10 04 40 44 82
004010E0 90 00 26 10 00 00 04 00 82 00 00 00 20 40 00 00
004010F0 B4 00 00 40 00 02 20 25 08 00 00 00 00 00 00 00
00401100 08 00 00 50 00 08 40 50 00 02 06 22 08 85 30 00
00401110 00 80 00 80 60 00 09 00 04 20 00 00 00 00 00 00
00401120 00 82 40 02 00 11 46 01 4A 01 8C 01 E6 00 86 10
00401130 4C 01 22 00 64 00 AE 01 EA 01 2A 11 E8 10 26 11
00401140 4E 11 8E 11 C2 00 6C 00 0C 11 60 01 CA 00 62 10
00401150 6C 01 A0 11 CE 10 2C 11 4E 10 8C 00 CE 01 AE 01
00401160 6C 10 6C 11 A2 01 AE 00 46 11 EE 10 22 00 A8 00
00401170 EC 01 08 11 A2 01 AE 10 6C 00 6E 00 AC 11 8C 00
00401180 EC 01 2A 10 2A 01 AE 00 40 00 C8 10 48 01 4E 11
00401190 0E 00 EC 11 24 10 4A 10 04 01 C8 11 E6 01 C2 00

5. Performance Metric

Back to the top

Source: https://www.kaggle.com/c/malware-classification#evaluation

Metric(s):

  • Multi class log-loss
  • Confusion matrix

6. Machine Learing Objectives and Constraints

Back to the top

Objective: Predict the probability of each data-point belonging to each of the nine classes.

Constraints:

  • Class probabilities are needed.
  • Penalize the errors in class probabilites => Metric is Log-loss.
  • Some Latency constraints.

7. Exploratory Data Analysis

Back to the top

In [ ]:
%%time

%pip install -U tornado
%pip install "dask[complete]"

import warnings
warnings.filterwarnings("ignore")
import shutil
import os
import pandas as pd
import matplotlib
matplotlib.use(u'nbAgg')
import matplotlib.pyplot as plt
import seaborn as sns
import numpy as np
from tqdm import tqdm
import pickle
from sklearn.manifold import TSNE
from sklearn import preprocessing
import pandas as pd
from multiprocessing import Process# this is used for multithreading
import multiprocessing
import codecs# this is used for file operations 
import random as r
from xgboost import XGBClassifier
from sklearn.model_selection import RandomizedSearchCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.calibration import CalibratedClassifierCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import log_loss
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
import re
from nltk.util import ngrams
from sklearn.feature_selection import SelectKBest, chi2, f_regression

import scipy.sparse
import gc
import pickle as pkl
from datetime import datetime as dt
import dask.dataframe as dd
In [ ]:
# separating byte files and asm files 
# Below is from AML Assignment file
from google.colab import drive
drive.mount('/content/gdrive')

root_path = '/content/gdrive/MyDrive/AML_Malware/Full_data/'
# root_path = '../../LARGE_Datasets/'
In [ ]:
#separating byte files and asm files 

source = 'train'
destination_1 = 'byteFiles'
destination_2 = 'asmFiles'

# we will check if the folder 'byteFiles' exists if it not there we will create a folder with the same name
if not os.path.isdir(destination_1):
    os.makedirs(destination_1)
if not os.path.isdir(destination_2):
    os.makedirs(destination_2)

# if we have folder called 'train' (train folder contains both .asm files and .bytes files) we will rename it 'asmFiles'
# for every file that we have in our 'asmFiles' directory we check if it is ending with .bytes, if yes we will move it to
# 'byteFiles' folder

# so by the end of this snippet we will separate all the .byte files and .asm files
if os.path.isdir(source):
    data_files = os.listdir(source)
    for file in data_files:
        print(file)
        if (file.endswith("bytes")):
            shutil.move(source+'\\'+file,destination_1)
        if (file.endswith("asm")):
            shutil.move(source+'\\'+file,destination_2)

8. Distribution of malware classes in whole data set

Back to the top

In [ ]:
Y=pd.read_csv("trainLabels.csv")
total = len(Y)*1.
ax=sns.countplot(x="Class", data=Y)
for p in ax.patches:
        ax.annotate('{:.1f}%'.format(100*p.get_height()/total), (p.get_x()+0.1, p.get_height()+5))

#put 11 ticks (therefore 10 steps), from 0 to the total number of rows in the dataframe
ax.yaxis.set_ticks(np.linspace(0, total, 11))

#adjust the ticklabel to the desired format, without changing the position of the ticks. 
ax.set_yticklabels(map('{:.1f}%'.format, 100*ax.yaxis.get_majorticklocs()/total))
plt.show()

9. File size of byte files as a feature

Back to the top

In [ ]:
#file sizes of byte files

files=os.listdir('byteFiles')
filenames=Y['Id'].tolist()
class_y=Y['Class'].tolist()
class_bytes=[]
sizebytes=[]
fnames=[]
for file in files:
    # print(os.stat('byteFiles/0A32eTdBKayjCWhZqDOQ.txt'))
    # os.stat_result(st_mode=33206, st_ino=1125899906874507, st_dev=3561571700, st_nlink=1, st_uid=0, st_gid=0, 
    # st_size=3680109, st_atime=1519638522, st_mtime=1519638522, st_ctime=1519638522)
    # read more about os.stat: here https://www.tutorialspoint.com/python/os_stat.htm
    statinfo=os.stat('byteFiles/'+file)
    # split the file name at '.' and take the first part of it i.e the file name
    file=file.split('.')[0]
    if any(file == filename for filename in filenames):
        i=filenames.index(file)
        class_bytes.append(class_y[i])
        # converting into Mb's
        sizebytes.append(statinfo.st_size/(1024.0*1024.0))
        fnames.append(file)
data_size_byte=pd.DataFrame({'ID':fnames,'size':sizebytes,'Class':class_bytes})
print (data_size_byte.head())
                     ID      size  Class
0  01azqd4InC7m9JpocGv5  4.234863      9
1  01IsoiSMh5gxyDYTl4CB  5.538818      2
2  01jsnpXSAlgw6aPeDxrU  3.887939      9
3  01kcPWA9K2BOxQeS5Rju  0.574219      1
4  01SuzwMJEIXsK7A8dQbl  0.370850      8

10. box plots of file size (.byte files) feature

Back to the top

In [ ]:
#boxplot of byte files
ax = sns.boxplot(x="Class", y="size", data=data_size_byte)
plt.title("boxplot of .bytes file sizes")
plt.show()

11. Uni-Gram Byte Feature extraction from byte files

Back to the top

In [ ]:
#removal of addres from byte files
# contents of .byte files
# ----------------
#00401000 56 8D 44 24 08 50 8B F1 E8 1C 1B 00 00 C7 06 08 
#-------------------
#we remove the starting address 00401000

files = os.listdir('byteFiles')
filenames=[]
array=[]
for file in files:
    if(file.endswith("bytes")):
        file=file.split('.')[0]
        text_file = open('byteFiles/'+file+".txt", 'w+')
        with open('byteFiles/'+file+".bytes","r") as fp:
            lines=""
            for line in fp:
                a=line.rstrip().split(" ")[1:]
                b=' '.join(a)
                b=b+"\n"
                text_file.write(b)
            fp.close()
            os.remove('byteFiles/'+file+".bytes")
        text_file.close()

files = os.listdir('byteFiles')
filenames2=[]
feature_matrix = np.zeros((len(files),257),dtype=int)
k=0



# program to convert into bag of words of bytefiles
# this is custom-built bag of words this is unigram bag of words
# This is a Custom Implementation of CountVectorizer as CountVectorizer will NOT suport working on such huge file system of 50GB
# For this Uni-Gram feature creating and writing to a file named 'result.csv'

byte_feature_file=open('result.csv','w+')
byte_feature_file.write("ID,0,1,2,3,4,5,6,7,8,9,0a,0b,0c,0d,0e,0f,10,11,12,13,14,15,16,17,18,19,1a,1b,1c,1d,1e,1f,20,21,22,23,24,25,26,27,28,29,2a,2b,2c,2d,2e,2f,30,31,32,33,34,35,36,37,38,39,3a,3b,3c,3d,3e,3f,40,41,42,43,44,45,46,47,48,49,4a,4b,4c,4d,4e,4f,50,51,52,53,54,55,56,57,58,59,5a,5b,5c,5d,5e,5f,60,61,62,63,64,65,66,67,68,69,6a,6b,6c,6d,6e,6f,70,71,72,73,74,75,76,77,78,79,7a,7b,7c,7d,7e,7f,80,81,82,83,84,85,86,87,88,89,8a,8b,8c,8d,8e,8f,90,91,92,93,94,95,96,97,98,99,9a,9b,9c,9d,9e,9f,a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,aa,ab,ac,ad,ae,af,b0,b1,b2,b3,b4,b5,b6,b7,b8,b9,ba,bb,bc,bd,be,bf,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,ca,cb,cc,cd,ce,cf,d0,d1,d2,d3,d4,d5,d6,d7,d8,d9,da,db,dc,dd,de,df,e0,e1,e2,e3,e4,e5,e6,e7,e8,e9,ea,eb,ec,ed,ee,ef,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,fa,fb,fc,fd,fe,ff,??")

byte_feature_file.write("\n")

for file in files:
    filenames2.append(file)
    byte_feature_file.write(file+",")
    if(file.endswith("txt")):
        with open('byteFiles/'+file,"r") as byte_flie:
            for lines in byte_flie:
                line=lines.rstrip().split(" ")
                for hex_code in line:
                    if hex_code=='??':
                        feature_matrix[k][256]+=1
                    else:
                        feature_matrix[k][int(hex_code,16)]+=1
        byte_flie.close()
    for i, row in enumerate(feature_matrix[k]):
        if i!=len(feature_matrix[k])-1:
            byte_feature_file.write(str(row)+",")
        else:
            byte_feature_file.write(str(row))
    byte_feature_file.write("\n")
    
    k += 1

byte_feature_file.close()
In [ ]:
byte_features=pd.read_csv("result.csv")
byte_features['ID']  = byte_features['ID'].str.split('.').str[0]
byte_features.head(2)
Out[ ]:
ID 0 1 2 3 4 5 6 7 8 ... f7 f8 f9 fa fb fc fd fe ff ??
0 01azqd4InC7m9JpocGv5 601905 3905 2816 3832 3345 3242 3650 3201 2965 ... 2804 3687 3101 3211 3097 2758 3099 2759 5753 1824
1 01IsoiSMh5gxyDYTl4CB 39755 8337 7249 7186 8663 6844 8420 7589 9291 ... 451 6536 439 281 302 7639 518 17001 54902 8588

2 rows × 258 columns

In [ ]:
data_size_byte.head(2)
Out[ ]:
ID size Class
0 01azqd4InC7m9JpocGv5 4.234863 9
1 01IsoiSMh5gxyDYTl4CB 5.538818 2
In [ ]:
byte_features_with_size = byte_features.merge(data_size_byte, on='ID')
byte_features_with_size.to_csv("result_with_size.csv")
byte_features_with_size.head(2)
Out[ ]:
ID 0 1 2 3 4 5 6 7 8 ... f9 fa fb fc fd fe ff ?? size Class
0 01azqd4InC7m9JpocGv5 601905 3905 2816 3832 3345 3242 3650 3201 2965 ... 3101 3211 3097 2758 3099 2759 5753 1824 4.234863 9
1 01IsoiSMh5gxyDYTl4CB 39755 8337 7249 7186 8663 6844 8420 7589 9291 ... 439 281 302 7639 518 17001 54902 8588 5.538818 2

2 rows × 260 columns

In [ ]:
# https://stackoverflow.com/a/29651514
def normalize(df):
    result1 = df.copy()
    for feature_name in df.columns:
        if (str(feature_name) != str('ID') and str(feature_name)!=str('Class')):
            max_value = df[feature_name].max()
            min_value = df[feature_name].min()
            result1[feature_name] = (df[feature_name] - min_value) / (max_value - min_value)
    return result1

result = normalize(byte_features_with_size)
In [ ]:
result.head(2)
Out[ ]:
ID 0 1 2 3 4 5 6 7 8 ... f9 fa fb fc fd fe ff ?? size Class
0 01azqd4InC7m9JpocGv5 0.262806 0.005498 0.001567 0.002067 0.002048 0.001835 0.002058 0.002946 0.002638 ... 0.01356 0.013107 0.013634 0.031724 0.014549 0.014348 0.007843 0.000129 0.092219 9
1 01IsoiSMh5gxyDYTl4CB 0.017358 0.011737 0.004033 0.003876 0.005303 0.003873 0.004747 0.006984 0.008267 ... 0.00192 0.001147 0.001329 0.087867 0.002432 0.088411 0.074851 0.000606 0.121236 2

2 rows × 260 columns

In [ ]:
data_y = result['Class']
result.head()
Out[ ]:
ID 0 1 2 3 4 5 6 7 8 ... f9 fa fb fc fd fe ff ?? Class size
0 01azqd4InC7m9JpocGv5 0.262806 0.005498 0.001567 0.002067 0.002048 0.001835 0.002058 0.002946 0.002638 ... 0.013560 0.013107 0.013634 0.031724 0.014549 0.014348 0.007843 0.000129 9 0.092219
1 01IsoiSMh5gxyDYTl4CB 0.017358 0.011737 0.004033 0.003876 0.005303 0.003873 0.004747 0.006984 0.008267 ... 0.001920 0.001147 0.001329 0.087867 0.002432 0.088411 0.074851 0.000606 2 0.121236
2 01jsnpXSAlgw6aPeDxrU 0.040827 0.013434 0.001429 0.001315 0.005464 0.005280 0.005078 0.002155 0.008104 ... 0.009804 0.011777 0.012604 0.028423 0.013080 0.013937 0.067001 0.000033 9 0.084499
3 01kcPWA9K2BOxQeS5Rju 0.009209 0.001708 0.000404 0.000441 0.000770 0.000354 0.000310 0.000481 0.000959 ... 0.002121 0.001886 0.002272 0.013032 0.002211 0.003957 0.010904 0.000984 1 0.010759
4 01SuzwMJEIXsK7A8dQbl 0.008629 0.001000 0.000168 0.000234 0.000342 0.000232 0.000148 0.000229 0.000376 ... 0.001530 0.000853 0.001052 0.007511 0.001038 0.001258 0.002998 0.000636 8 0.006233

5 rows × 260 columns

12. Multivariate Analysis on byte files

Back to the top

In [ ]:
#multivariate analysis on byte files
#this is with perplexity 50
xtsne=TSNE(perplexity=50)
results=xtsne.fit_transform(result.drop(['ID','Class'], axis=1))
vis_x = results[:, 0]
vis_y = results[:, 1]
plt.scatter(vis_x, vis_y, c=data_y, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(10))
plt.clim(0.5, 9)
plt.show()
In [ ]:
#this is with perplexity 30
xtsne=TSNE(perplexity=30)
results=xtsne.fit_transform(result.drop(['ID','Class'], axis=1))
vis_x = results[:, 0]
vis_y = results[:, 1]
plt.scatter(vis_x, vis_y, c=data_y, cmap=plt.cm.get_cmap("jet", 9))
plt.colorbar(ticks=range(10))
plt.clim(0.5, 9)
plt.show()

13. Train Test split of only Byte Files Features

Back to the top

In [ ]:
data_y = result['Class']
# split the data into test and train by maintaining same distribution of output varaible 'y_true' [stratify=y_true]
X_train, X_test, y_train, y_test = train_test_split(result.drop(['ID','Class'], axis=1), data_y,stratify=data_y,test_size=0.20)
# split the train data into train and cross validation by maintaining same distribution of output varaible 'y_train' [stratify=y_train]
X_train, X_cv, y_train, y_cv = train_test_split(X_train, y_train,stratify=y_train,test_size=0.20)
In [ ]:
print('Number of data points in train data:', X_train.shape[0])
print('Number of data points in test data:', X_test.shape[0])
print('Number of data points in cross validation data:', X_cv.shape[0])
Number of data points in train data: 6955
Number of data points in test data: 2174
Number of data points in cross validation data: 1739
In [ ]:
# it returns a dict, keys as class labels and values as the number of data points in that class
train_class_distribution = y_train.value_counts().sortlevel()
test_class_distribution = y_test.value_counts().sortlevel()
cv_class_distribution = y_cv.value_counts().sortlevel()

my_colors = 'rgbkymc'
train_class_distribution.plot(kind='bar', color=my_colors)
plt.xlabel('Class')
plt.ylabel('Data points per Class')
plt.title('Distribution of yi in train data')
plt.grid()
plt.show()

# ref: argsort https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
# -(train_class_distribution.values): the minus sign will give us in decreasing order
sorted_yi = np.argsort(-train_class_distribution.values)
for i in sorted_yi:
    print('Number of data points in class', i+1, ':',train_class_distribution.values[i], '(', np.round((train_class_distribution.values[i]/y_train.shape[0]*100), 3), '%)')

    
print('-'*80)
my_colors = 'rgbkymc'
test_class_distribution.plot(kind='bar', color=my_colors)
plt.xlabel('Class')
plt.ylabel('Data points per Class')
plt.title('Distribution of yi in test data')
plt.grid()
plt.show()

# ref: argsort https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
# -(train_class_distribution.values): the minus sign will give us in decreasing order
sorted_yi = np.argsort(-test_class_distribution.values)
for i in sorted_yi:
    print('Number of data points in class', i+1, ':',test_class_distribution.values[i], '(', np.round((test_class_distribution.values[i]/y_test.shape[0]*100), 3), '%)')

print('-'*80)
my_colors = 'rgbkymc'
cv_class_distribution.plot(kind='bar', color=my_colors)
plt.xlabel('Class')
plt.ylabel('Data points per Class')
plt.title('Distribution of yi in cross validation data')
plt.grid()
plt.show()

# ref: argsort https://docs.scipy.org/doc/numpy/reference/generated/numpy.argsort.html
# -(train_class_distribution.values): the minus sign will give us in decreasing order
sorted_yi = np.argsort(-train_class_distribution.values)
for i in sorted_yi:
    print('Number of data points in class', i+1, ':',cv_class_distribution.values[i], '(', np.round((cv_class_distribution.values[i]/y_cv.shape[0]*100), 3), '%)')
Number of data points in class 3 : 1883 ( 27.074 %)
Number of data points in class 2 : 1586 ( 22.804 %)
Number of data points in class 1 : 986 ( 14.177 %)
Number of data points in class 8 : 786 ( 11.301 %)
Number of data points in class 9 : 648 ( 9.317 %)
Number of data points in class 6 : 481 ( 6.916 %)
Number of data points in class 4 : 304 ( 4.371 %)
Number of data points in class 7 : 254 ( 3.652 %)
Number of data points in class 5 : 27 ( 0.388 %)
--------------------------------------------------------------------------------
Number of data points in class 3 : 588 ( 27.047 %)
Number of data points in class 2 : 496 ( 22.815 %)
Number of data points in class 1 : 308 ( 14.167 %)
Number of data points in class 8 : 246 ( 11.316 %)
Number of data points in class 9 : 203 ( 9.338 %)
Number of data points in class 6 : 150 ( 6.9 %)
Number of data points in class 4 : 95 ( 4.37 %)
Number of data points in class 7 : 80 ( 3.68 %)
Number of data points in class 5 : 8 ( 0.368 %)
--------------------------------------------------------------------------------
Number of data points in class 3 : 471 ( 27.085 %)
Number of data points in class 2 : 396 ( 22.772 %)
Number of data points in class 1 : 247 ( 14.204 %)
Number of data points in class 8 : 196 ( 11.271 %)
Number of data points in class 9 : 162 ( 9.316 %)
Number of data points in class 6 : 120 ( 6.901 %)
Number of data points in class 4 : 76 ( 4.37 %)
Number of data points in class 7 : 64 ( 3.68 %)
Number of data points in class 5 : 7 ( 0.403 %)
In [ ]:
def plot_confusion_matrix(test_y, predict_y):
    C = confusion_matrix(test_y, predict_y)
    print("Number of misclassified points ",(len(test_y)-np.trace(C))/len(test_y)*100)
    # C = 9,9 matrix, each cell (i,j) represents number of points of class i are predicted class j
    
    A =(((C.T)/(C.sum(axis=1))).T)
    #divid each element of the confusion matrix with the sum of elements in that column
    
    # C = [[1, 2],
    #     [3, 4]]
    # C.T = [[1, 3],
    #        [2, 4]]
    # C.sum(axis = 1)  axis=0 corresonds to columns and axis=1 corresponds to rows in two diamensional array
    # C.sum(axix =1) = [[3, 7]]
    # ((C.T)/(C.sum(axis=1))) = [[1/3, 3/7]
    #                           [2/3, 4/7]]

    # ((C.T)/(C.sum(axis=1))).T = [[1/3, 2/3]
    #                           [3/7, 4/7]]
    # sum of row elements = 1
    
    B =(C/C.sum(axis=0))
    #divid each element of the confusion matrix with the sum of elements in that row
    # C = [[1, 2],
    #     [3, 4]]
    # C.sum(axis = 0)  axis=0 corresonds to columns and axis=1 corresponds to rows in two diamensional array
    # C.sum(axix =0) = [[4, 6]]
    # (C/C.sum(axis=0)) = [[1/4, 2/6],
    #                      [3/4, 4/6]] 
    
    labels = [1,2,3,4,5,6,7,8,9]
    cmap=sns.light_palette("green")
    # representing A in heatmap format
    print("-"*50, "Confusion matrix", "-"*50)
    plt.figure(figsize=(10,5))
    sns.heatmap(C, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Original Class')
    plt.show()

    print("-"*50, "Precision matrix", "-"*50)
    plt.figure(figsize=(10,5))
    sns.heatmap(B, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Original Class')
    plt.show()
    print("Sum of columns in precision matrix",B.sum(axis=0))
    
    # representing B in heatmap format
    print("-"*50, "Recall matrix"    , "-"*50)
    plt.figure(figsize=(10,5))
    sns.heatmap(A, annot=True, cmap=cmap, fmt=".3f", xticklabels=labels, yticklabels=labels)
    plt.xlabel('Predicted Class')
    plt.ylabel('Original Class')
    plt.show()
    print("Sum of rows in precision matrix",A.sum(axis=1))

14. Random Model ONLY on bytes files

Back to the top

In [ ]:
# we need to generate 9 numbers and the sum of numbers should be 1
# one solution is to genarate 9 numbers and divide each of the numbers by their sum
# ref: https://stackoverflow.com/a/18662466/4084039

test_data_len = X_test.shape[0]
cv_data_len = X_cv.shape[0]

# we create a output array that has exactly same size as the CV data
cv_predicted_y = np.zeros((cv_data_len,9))
for i in range(cv_data_len):
    rand_probs = np.random.rand(1,9)
    cv_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Cross Validation Data using Random Model",log_loss(y_cv,cv_predicted_y, eps=1e-15))


# Test-Set error.
#we create a output array that has exactly same as the test data
test_predicted_y = np.zeros((test_data_len,9))
for i in range(test_data_len):
    rand_probs = np.random.rand(1,9)
    test_predicted_y[i] = ((rand_probs/sum(sum(rand_probs)))[0])
print("Log loss on Test Data using Random Model",log_loss(y_test,test_predicted_y, eps=1e-15))

predicted_y =np.argmax(test_predicted_y, axis=1)
plot_confusion_matrix(y_test, predicted_y+1)
Log loss on Cross Validation Data using Random Model 2.45615644965
Log loss on Test Data using Random Model 2.48503905509
Number of misclassified points  88.5004599816
-------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]
-------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]

15. K Nearest Neighbour Classification ONLY on bytes files

Back to the top

In [ ]:
# find more about KNeighborsClassifier() here http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
# -------------------------
# default parameter
# KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, 
# metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)

# methods of
# fit(X, y) : Fit the model using X as training data and y as target values
# predict(X):Predict the class labels for the provided data
# predict_proba(X):Return probability estimates for the test data X.

# find more about CalibratedClassifierCV here at http://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html
# ----------------------------
# default paramters
# sklearn.calibration.CalibratedClassifierCV(base_estimator=None, method=’sigmoid’, cv=3)
#
# some of the methods of CalibratedClassifierCV()
# fit(X, y[, sample_weight])	Fit the calibrated model
# get_params([deep])	Get parameters for this estimator.
# predict(X)	Predict the target of new samples.
# predict_proba(X)	Posterior probabilities of classification

  
alpha = [x for x in range(1, 15, 2)]
cv_log_error_array=[]
for i in alpha:
    k_clf=KNeighborsClassifier(n_neighbors=i)
    k_clf.fit(X_train,y_train)
    sig_clf = CalibratedClassifierCV(k_clf, method="sigmoid")
    sig_clf.fit(X_train, y_train)
    predict_y = sig_clf.predict_proba(X_cv)
    cv_log_error_array.append(log_loss(y_cv, predict_y, labels=k_clf.classes_, eps=1e-15))
    
for i in range(len(cv_log_error_array)):
    print ('log_loss for k = ',alpha[i],'is',cv_log_error_array[i])

best_alpha = np.argmin(cv_log_error_array)
    
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()

k_clf=KNeighborsClassifier(n_neighbors=alpha[best_alpha])
k_clf.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(k_clf, method="sigmoid")
sig_clf.fit(X_train, y_train)
    
predict_y = sig_clf.predict_proba(X_train)
print ('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y))
predict_y = sig_clf.predict_proba(X_cv)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv, predict_y))
predict_y = sig_clf.predict_proba(X_test)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y))
plot_confusion_matrix(y_test, sig_clf.predict(X_test))
log_loss for k =  1 is 0.225386237304
log_loss for k =  3 is 0.230795229168
log_loss for k =  5 is 0.252421408646
log_loss for k =  7 is 0.273827486888
log_loss for k =  9 is 0.286469181555
log_loss for k =  11 is 0.29623391147
log_loss for k =  13 is 0.307551203154
For values of best alpha =  1 The train log loss is: 0.0782947669247
For values of best alpha =  1 The cross validation log loss is: 0.225386237304
For values of best alpha =  1 The test log loss is: 0.241508604195
Number of misclassified points  4.50781968721
-------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]
-------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]

16. Logistic Regression ONLY on bytes files

Back to the top

In [ ]:
# read more about SGDClassifier() at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
# ------------------------------
# default parameters
# SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None, 
# shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5, 
# class_weight=None, warm_start=False, average=False, n_iter=None)

# some of methods
# fit(X, y[, coef_init, intercept_init, …])	Fit linear model with Stochastic Gradient Descent.
# predict(X)	Predict class labels for samples in X.


alpha = [10 ** x for x in range(-5, 4)]
cv_log_error_array=[]
for i in alpha:
    logisticR=LogisticRegression(penalty='l2',C=i,class_weight='balanced')
    logisticR.fit(X_train,y_train)
    sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
    sig_clf.fit(X_train, y_train)
    predict_y = sig_clf.predict_proba(X_cv)
    cv_log_error_array.append(log_loss(y_cv, predict_y, labels=logisticR.classes_, eps=1e-15))
    
for i in range(len(cv_log_error_array)):
    print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])

best_alpha = np.argmin(cv_log_error_array)
    
fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()

logisticR=LogisticRegression(penalty='l2',C=alpha[best_alpha],class_weight='balanced')
logisticR.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
sig_clf.fit(X_train, y_train)
pred_y=sig_clf.predict(X_test)

predict_y = sig_clf.predict_proba(X_train)
print ('log loss for train data',log_loss(y_train, predict_y, labels=logisticR.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_cv)
print ('log loss for cv data',log_loss(y_cv, predict_y, labels=logisticR.classes_, eps=1e-15))
predict_y = sig_clf.predict_proba(X_test)
print ('log loss for test data',log_loss(y_test, predict_y, labels=logisticR.classes_, eps=1e-15))
plot_confusion_matrix(y_test, sig_clf.predict(X_test))
log_loss for c =  1e-05 is 1.56916911178
log_loss for c =  0.0001 is 1.57336384417
log_loss for c =  0.001 is 1.53598598273
log_loss for c =  0.01 is 1.01720972418
log_loss for c =  0.1 is 0.857766083873
log_loss for c =  1 is 0.711154393309
log_loss for c =  10 is 0.583929522635
log_loss for c =  100 is 0.549929846589
log_loss for c =  1000 is 0.624746769121
log loss for train data 0.498923428696
log loss for cv data 0.549929846589
log loss for test data 0.528347316704
Number of misclassified points  12.3275068997
-------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [  1.   1.   1.   1.  nan   1.   1.   1.   1.]
-------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]

17. Random Forest Classifier ONLY on bytes files

Back to the top

In [ ]:
# --------------------------------
# default parameters 
# sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, 
# min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, 
# min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, 
# class_weight=None)

# Some of methods of RandomForestClassifier()
# fit(X, y, [sample_weight])	Fit the SVM model according to the given training data.
# predict(X)	Perform classification on samples in X.
# predict_proba (X)	Perform classification on samples in X.

# some of attributes of  RandomForestClassifier()
# feature_importances_ : array of shape = [n_features]
# The feature importances (the higher, the more important the feature).

alpha=[10,50,100,500,1000,2000,3000]
cv_log_error_array=[]
train_log_error_array=[]
from sklearn.ensemble import RandomForestClassifier
for i in alpha:
    r_clf=RandomForestClassifier(n_estimators=i,random_state=42,n_jobs=-1)
    r_clf.fit(X_train,y_train)
    sig_clf = CalibratedClassifierCV(r_clf, method="sigmoid")
    sig_clf.fit(X_train, y_train)
    predict_y = sig_clf.predict_proba(X_cv)
    cv_log_error_array.append(log_loss(y_cv, predict_y, labels=r_clf.classes_, eps=1e-15))

for i in range(len(cv_log_error_array)):
    print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])


best_alpha = np.argmin(cv_log_error_array)

fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()


r_clf=RandomForestClassifier(n_estimators=alpha[best_alpha],random_state=42,n_jobs=-1)
r_clf.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(r_clf, method="sigmoid")
sig_clf.fit(X_train, y_train)

predict_y = sig_clf.predict_proba(X_train)
print('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y))
predict_y = sig_clf.predict_proba(X_cv)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv, predict_y))
predict_y = sig_clf.predict_proba(X_test)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y))
plot_confusion_matrix(y_test, sig_clf.predict(X_test))
log_loss for c =  10 is 0.106357709164
log_loss for c =  50 is 0.0902124124145
log_loss for c =  100 is 0.0895043339776
log_loss for c =  500 is 0.0881420869288
log_loss for c =  1000 is 0.0879849524621
log_loss for c =  2000 is 0.0881566647295
log_loss for c =  3000 is 0.0881318948443
For values of best alpha =  1000 The train log loss is: 0.0266476291801
For values of best alpha =  1000 The cross validation log loss is: 0.0879849524621
For values of best alpha =  1000 The test log loss is: 0.0858346961407
Number of misclassified points  2.02391904324
-------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]
-------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]

18. XgBoost Classification ONLY on bytes files

In [ ]:
# Training a hyper-parameter tuned Xg-Boost regressor on our train data

# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, 
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, 
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, 
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)

# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep])	Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance


alpha=[10,50,100,500,1000,2000]
cv_log_error_array=[]
for i in alpha:
    x_clf=XGBClassifier(n_estimators=i,nthread=-1)
    x_clf.fit(X_train,y_train)
    sig_clf = CalibratedClassifierCV(x_clf, method="sigmoid")
    sig_clf.fit(X_train, y_train)
    predict_y = sig_clf.predict_proba(X_cv)
    cv_log_error_array.append(log_loss(y_cv, predict_y, labels=x_clf.classes_, eps=1e-15))

for i in range(len(cv_log_error_array)):
    print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])


best_alpha = np.argmin(cv_log_error_array)

fig, ax = plt.subplots()
ax.plot(alpha, cv_log_error_array,c='g')
for i, txt in enumerate(np.round(cv_log_error_array,3)):
    ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
plt.grid()
plt.title("Cross Validation Error for each alpha")
plt.xlabel("Alpha i's")
plt.ylabel("Error measure")
plt.show()

x_clf=XGBClassifier(n_estimators=alpha[best_alpha],nthread=-1)
x_clf.fit(X_train,y_train)
sig_clf = CalibratedClassifierCV(x_clf, method="sigmoid")
sig_clf.fit(X_train, y_train)
    
predict_y = sig_clf.predict_proba(X_train)
print ('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train, predict_y))
predict_y = sig_clf.predict_proba(X_cv)
print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv, predict_y))
predict_y = sig_clf.predict_proba(X_test)
print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test, predict_y))
plot_confusion_matrix(y_test, sig_clf.predict(X_test))
log_loss for c =  10 is 0.20615980494
log_loss for c =  50 is 0.123888382365
log_loss for c =  100 is 0.099919437112
log_loss for c =  500 is 0.0931035681289
log_loss for c =  1000 is 0.0933084876012
log_loss for c =  2000 is 0.0938395690309
For values of best alpha =  500 The train log loss is: 0.0225231805824
For values of best alpha =  500 The cross validation log loss is: 0.0931035681289
For values of best alpha =  500 The test log loss is: 0.0792067651731
Number of misclassified points  1.24195032199
-------------------------------------------------- Confusion matrix --------------------------------------------------
-------------------------------------------------- Precision matrix --------------------------------------------------
Sum of columns in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]
-------------------------------------------------- Recall matrix --------------------------------------------------
Sum of rows in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]

19. XgBoost Classification with best hyper parameters using RandomSearch ONLY on bytes files

Back to the top

In [ ]:
# https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python/
x_clf=XGBClassifier()

prams={
    'learning_rate':[0.01,0.03,0.05,0.1,0.15,0.2],
     'n_estimators':[100,200,500,1000,2000],
     'max_depth':[3,5,10],
    'colsample_bytree':[0.1,0.3,0.5,1],
    'subsample':[0.1,0.3,0.5,1]
}
random_clf1=RandomizedSearchCV(x_clf,param_distributions=prams,verbose=10,n_jobs=-1,)
random_clf1.fit(X_train,y_train)
Fitting 3 folds for each of 10 candidates, totalling 30 fits
[Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:   26.5s
[Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:  5.8min
[Parallel(n_jobs=-1)]: Done  19 out of  30 | elapsed:  9.3min remaining:  5.4min
[Parallel(n_jobs=-1)]: Done  23 out of  30 | elapsed: 10.1min remaining:  3.1min
[Parallel(n_jobs=-1)]: Done  27 out of  30 | elapsed: 14.0min remaining:  1.6min
[Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed: 14.2min finished
Out[ ]:
RandomizedSearchCV(cv=None, error_score='raise',
          estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
       gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='binary:logistic', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1),
          fit_params=None, iid=True, n_iter=10, n_jobs=-1,
          param_distributions={'learning_rate': [0.01, 0.03, 0.05, 0.1, 0.15, 0.2], 'n_estimators': [100, 200, 500, 1000, 2000], 'max_depth': [3, 5, 10], 'colsample_bytree': [0.1, 0.3, 0.5, 1], 'subsample': [0.1, 0.3, 0.5, 1]},
          pre_dispatch='2*n_jobs', random_state=None, refit=True,
          return_train_score=True, scoring=None, verbose=10)
In [ ]:
print (random_clf1.best_params_)
{'subsample': 1, 'n_estimators': 500, 'max_depth': 5, 'learning_rate': 0.05, 'colsample_bytree': 0.5}
In [ ]:
# Training a hyper-parameter tuned Xg-Boost regressor on our train data

# find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
# -------------------------
# default paramters
# class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, 
# objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, 
# max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, 
# scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)

# some of methods of RandomForestRegressor()
# fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
# get_params([deep])	Get parameters for this estimator.
# predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
# get_score(importance_type='weight') -> get the feature importance


x_clf=XGBClassifier(n_estimators=2000, learning_rate=0.05, colsample_bytree=1, max_depth=3)
x_clf.fit(X_train,y_train)
c_cfl=CalibratedClassifierCV(x_clf,method='sigmoid')
c_cfl.fit(X_train,y_train)

predict_y = c_cfl.predict_proba(X_train)
print ('train loss',log_loss(y_train, predict_y))
predict_y = c_cfl.predict_proba(X_cv)
print ('cv loss',log_loss(y_cv, predict_y))
predict_y = c_cfl.predict_proba(X_test)
print ('test loss',log_loss(y_test, predict_y))
train loss 0.022540976086
cv loss 0.0928710624158
test loss 0.0782688587098

20. Modeling with .asm files

Back to the top

There are 10868 files of asm All the files make up about 150 GB The asm files contains :

  1. Address
  2. Segments
  3. Opcodes
  4. Registers
  5. function calls
  6. APIs With the help of parallel processing we extracted all the features.In parallel we can use all the cores that are present in our computer.

Here we extracted 52 features from all the asm files which are important.

We read the top solutions and handpicked the features from those papers/videos/blogs.
Refer:https://www.kaggle.com/c/malware-classification/discussion

A note on opcode

“opcode” is short for operational code. These are the bytes stored in memory that the computer actually runs.

Here you will see all the opcodes that a processor supports. An assembler basically takes text and does some relatively simple conversion of it into a file of opcodes that the computer can read and run directly. Most assemble translate very directly to opcodes. And often the 3 to 5 character assembler names for the opcodes are called opcodes. Very technically the opcodes are the binary numbers stored in memory. The names for them in assembler are the opcode names. Also technically not all the binary numbers stored are the opcodes. The opcodes tell the CPU what to do. Often the number right after the opcodes are parameters for the opcode instructions.

Also Check this very complete table of x86 opcodes on x86asm.net and this as well.

There is also asmjit/asmdb project, which provides public domain X86/X64 database in a JSON-like format

Referring this Paper

In this thesis they used reverse engineering to extract the assembly instructions from a given executable file and chose to use only the opcodes, which are the part of the instruction that specifies the operation to be performed, an example "mov".

By performing statistical analysis on the datasets, a significant difference between the opcodes in malware and benign files was found. Due to this, supervised and unsupervised machine learning approaches like artificial neural network, support vector machine, bayes net, random forest, k nearest neighbours, and self organizing map was used to look at the sequences of these instructions. The unknown files were classified as either malware or benign depending on the presence of, and number of occurrences of different sequences. We show that by using only opcodes without operands (the rest of the instruction), malware can be distinguished from benign files. By using a sequence length of up to four opcodes, a classification accuracy of 95,58% was achieved.

21. Bag of Words Feature extraction from asm files

Back to the top

What we do here is - A. Collect the most important 52 keywords in the 4 variables named 'prefixes', 'opcodes' 'keywords' and 'registers' and then B. Calculate Bag of Words on the whole .asm file dataset based on these 52 Keywords. e.g. Given the word 'Header' is a keyword, we calculate how many times in a given .asm file, I see the word 'HEADER' or the Frequency of occurrences of the word 'Header' in a given .asm file.

  • To extract the unigram features from the .asm files we need to process ~150GB of data
  • Note: Below two cells will take lot of time (over 48 hours to complete)
  • We will provide you the output file of these two cells, which you can directly use it
  • The below 2 blocks of codes (for running Bag of Words or CountVectorizer on the 150GB asm files while handling it parallelly using multiple-cores of our machine) are taken from here.

    In the first block, I am doing the parallelization to read 150GB of data to be read with a chunk size of 30GB in each parallel

    • So first, create 5 folders
    • In each of these folders put 30GB of data
    • So now my machine's multi-core processor can read in parallel the 150gb data simultaneously reading it from the 5 different folders.
    In [ ]:
    # This code taken from https://github.com/kunwar-vikrant/Microsoft-Malware-Detection
    #intially create five folders
    #first 
    #second
    #thrid
    #fourth
    #fifth
    #this code tells us about random split of files into five folders
    folder_1 ='first'
    folder_2 ='second'
    folder_3 ='third'
    folder_4 ='fourth'
    folder_5 ='fifth'
    folder_6 = 'output'
    for i in [folder_1,folder_2,folder_3,folder_4,folder_5,folder_6]:
        if not os.path.isdir(i):
            os.makedirs(i)
    
    source='train/'
    files = os.listdir('train')
    ID=df['Id'].tolist()
    data=range(0,10868)
    r.shuffle(data)
    count=0
    for i in range(0,10868):
        if i % 5==0:
            shutil.move(source+files[data[i]],'first')
        elif i%5==1:
            shutil.move(source+files[data[i]],'second')
        elif i%5 ==2:
            shutil.move(source+files[data[i]],'thrid')
        elif i%5 ==3:
            shutil.move(source+files[data[i]],'fourth')
        elif i%5==4:
            shutil.move(source+files[data[i]],'fifth')
    

    And in the second block below what we are doing is

    • First collect the most important 52 keywords in the 4 variables named 'prefixes', 'opcodes' 'keywords' and 'registers'

    • Then run a Bag of Words on these 52 keywords, e.g. how many times I see the word 'HEADER' which is under the variable 'prefixes'

    • So this code below is just a custom implementation of the simple CountVectorizer() function of scikit-learn. But we could not use scikit-learn directly here as it will not support 150GB of data.

    In [ ]:
    # http://flint.cs.yale.edu/cs421/papers/x86-asm/asm.html
    
    def firstprocess():
        #The prefixes tells about the segments that are present in the asm files
        #There are 450 segments(approx) present in all asm files.
        #this prefixes are best segments that gives us best values.
        #https://en.wikipedia.org/wiki/Data_segment
        
        prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
        #this are opcodes that are used to get best results
        #https://en.wikipedia.org/wiki/X86_instruction_listings
        
        opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
        #best keywords that are taken from different blogs
        keywords = ['.dll','std::',':dword']
        #Below taken registers are general purpose registers and special registers
        #All the registers which are taken are best 
        registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
        file1=open("output\asmsmallfile.txt","w+")
        files = os.listdir('first')
        for f in files:
            #filling the values with zeros into the arrays
            prefixescount=np.zeros(len(prefixes),dtype=int)
            opcodescount=np.zeros(len(opcodes),dtype=int)
            keywordcount=np.zeros(len(keywords),dtype=int)
            registerscount=np.zeros(len(registers),dtype=int)
            features=[]
            f2=f.split('.')[0]
            file1.write(f2+",")
            opcodefile.write(f2+" ")
            # https://docs.python.org/3/library/codecs.html#codecs.ignore_errors
            # https://docs.python.org/3/library/codecs.html#codecs.Codec.encode
            with codecs.open('first/'+f,encoding='cp1252',errors ='replace') as fli:
                for lines in fli:
                    # https://www.tutorialspoint.com/python3/string_rstrip.htm
                    line=lines.rstrip().split()
                    l=line[0]
                    #counting the prefixs in each and every line
                    for i in range(len(prefixes)):
                        if prefixes[i] in line[0]:
                            prefixescount[i]+=1
                    line=line[1:]
                    #counting the opcodes in each and every line
                    for i in range(len(opcodes)):
                        if any(opcodes[i]==li for li in line):
                            features.append(opcodes[i])
                            opcodescount[i]+=1
                    #counting registers in the line
                    for i in range(len(registers)):
                        for li in line:
                            # we will use registers only in 'text' and 'CODE' segments
                            if registers[i] in li and ('text' in l or 'CODE' in l):
                                registerscount[i]+=1
                    #counting keywords in the line
                    for i in range(len(keywords)):
                        for li in line:
                            if keywords[i] in li:
                                keywordcount[i]+=1
            #pushing the values into the file after reading whole file
            for prefix in prefixescount:
                file1.write(str(prefix)+",")
            for opcode in opcodescount:
                file1.write(str(opcode)+",")
            for register in registerscount:
                file1.write(str(register)+",")
            for key in keywordcount:
                file1.write(str(key)+",")
            file1.write("\n")
        file1.close()
    
    
    #same as above 
    def secondprocess():
        prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
        opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
        keywords = ['.dll','std::',':dword']
        registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
        file1=open("output\mediumasmfile.txt","w+")
        files = os.listdir('second')
        for f in files:
            prefixescount=np.zeros(len(prefixes),dtype=int)
            opcodescount=np.zeros(len(opcodes),dtype=int)
            keywordcount=np.zeros(len(keywords),dtype=int)
            registerscount=np.zeros(len(registers),dtype=int)
            features=[]
            f2=f.split('.')[0]
            file1.write(f2+",")
            opcodefile.write(f2+" ")
            with codecs.open('second/'+f,encoding='cp1252',errors ='replace') as fli:
                for lines in fli:
                    line=lines.rstrip().split()
                    l=line[0]
                    for i in range(len(prefixes)):
                        if prefixes[i] in line[0]:
                            prefixescount[i]+=1
                    line=line[1:]
                    for i in range(len(opcodes)):
                        if any(opcodes[i]==li for li in line):
                            features.append(opcodes[i])
                            opcodescount[i]+=1
                    for i in range(len(registers)):
                        for li in line:
                            if registers[i] in li and ('text' in l or 'CODE' in l):
                                registerscount[i]+=1
                    for i in range(len(keywords)):
                        for li in line:
                            if keywords[i] in li:
                                keywordcount[i]+=1
            for prefix in prefixescount:
                file1.write(str(prefix)+",")
            for opcode in opcodescount:
                file1.write(str(opcode)+",")
            for register in registerscount:
                file1.write(str(register)+",")
            for key in keywordcount:
                file1.write(str(key)+",")
            file1.write("\n")
        file1.close()
    
    # same as smallprocess() functions
    def thirdprocess():
        prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
        opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
        keywords = ['.dll','std::',':dword']
        registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
        file1=open("output\largeasmfile.txt","w+")
        files = os.listdir('thrid')
        for f in files:
            prefixescount=np.zeros(len(prefixes),dtype=int)
            opcodescount=np.zeros(len(opcodes),dtype=int)
            keywordcount=np.zeros(len(keywords),dtype=int)
            registerscount=np.zeros(len(registers),dtype=int)
            features=[]
            f2=f.split('.')[0]
            file1.write(f2+",")
            opcodefile.write(f2+" ")
            with codecs.open('thrid/'+f,encoding='cp1252',errors ='replace') as fli:
                for lines in fli:
                    line=lines.rstrip().split()
                    l=line[0]
                    for i in range(len(prefixes)):
                        if prefixes[i] in line[0]:
                            prefixescount[i]+=1
                    line=line[1:]
                    for i in range(len(opcodes)):
                        if any(opcodes[i]==li for li in line):
                            features.append(opcodes[i])
                            opcodescount[i]+=1
                    for i in range(len(registers)):
                        for li in line:
                            if registers[i] in li and ('text' in l or 'CODE' in l):
                                registerscount[i]+=1
                    for i in range(len(keywords)):
                        for li in line:
                            if keywords[i] in li:
                                keywordcount[i]+=1
            for prefix in prefixescount:
                file1.write(str(prefix)+",")
            for opcode in opcodescount:
                file1.write(str(opcode)+",")
            for register in registerscount:
                file1.write(str(register)+",")
            for key in keywordcount:
                file1.write(str(key)+",")
            file1.write("\n")
        file1.close()
    
    
    def fourthprocess():
        prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
        opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
        keywords = ['.dll','std::',':dword']
        registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
        file1=open("output\hugeasmfile.txt","w+")
        files = os.listdir('fourth/')
        for f in files:
            prefixescount=np.zeros(len(prefixes),dtype=int)
            opcodescount=np.zeros(len(opcodes),dtype=int)
            keywordcount=np.zeros(len(keywords),dtype=int)
            registerscount=np.zeros(len(registers),dtype=int)
            features=[]
            f2=f.split('.')[0]
            file1.write(f2+",")
            opcodefile.write(f2+" ")
            with codecs.open('fourth/'+f,encoding='cp1252',errors ='replace') as fli:
                for lines in fli:
                    line=lines.rstrip().split()
                    l=line[0]
                    for i in range(len(prefixes)):
                        if prefixes[i] in line[0]:
                            prefixescount[i]+=1
                    line=line[1:]
                    for i in range(len(opcodes)):
                        if any(opcodes[i]==li for li in line):
                            features.append(opcodes[i])
                            opcodescount[i]+=1
                    for i in range(len(registers)):
                        for li in line:
                            if registers[i] in li and ('text' in l or 'CODE' in l):
                                registerscount[i]+=1
                    for i in range(len(keywords)):
                        for li in line:
                            if keywords[i] in li:
                                keywordcount[i]+=1
            for prefix in prefixescount:
                file1.write(str(prefix)+",")
            for opcode in opcodescount:
                file1.write(str(opcode)+",")
            for register in registerscount:
                file1.write(str(register)+",")
            for key in keywordcount:
                file1.write(str(key)+",")
            file1.write("\n")
        file1.close()
    
    
    def fifthprocess():
        prefixes = ['HEADER:','.text:','.Pav:','.idata:','.data:','.bss:','.rdata:','.edata:','.rsrc:','.tls:','.reloc:','.BSS:','.CODE']
        opcodes = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
        keywords = ['.dll','std::',':dword']
        registers=['edx','esi','eax','ebx','ecx','edi','ebp','esp','eip']
        file1=open("output\trainasmfile.txt","w+")
        files = os.listdir('fifth/')
        for f in files:
            prefixescount=np.zeros(len(prefixes),dtype=int)
            opcodescount=np.zeros(len(opcodes),dtype=int)
            keywordcount=np.zeros(len(keywords),dtype=int)
            registerscount=np.zeros(len(registers),dtype=int)
            features=[]
            f2=f.split('.')[0]
            file1.write(f2+",")
            opcodefile.write(f2+" ")
            with codecs.open('fifth/'+f,encoding='cp1252',errors ='replace') as fli:
                for lines in fli:
                    line=lines.rstrip().split()
                    l=line[0]
                    for i in range(len(prefixes)):
                        if prefixes[i] in line[0]:
                            prefixescount[i]+=1
                    line=line[1:]
                    for i in range(len(opcodes)):
                        if any(opcodes[i]==li for li in line):
                            features.append(opcodes[i])
                            opcodescount[i]+=1
                    for i in range(len(registers)):
                        for li in line:
                            if registers[i] in li and ('text' in l or 'CODE' in l):
                                registerscount[i]+=1
                    for i in range(len(keywords)):
                        for li in line:
                            if keywords[i] in li:
                                keywordcount[i]+=1
            for prefix in prefixescount:
                file1.write(str(prefix)+",")
            for opcode in opcodescount:
                file1.write(str(opcode)+",")
            for register in registerscount:
                file1.write(str(register)+",")
            for key in keywordcount:
                file1.write(str(key)+",")
            file1.write("\n")
        file1.close()
    
    
    def main():
        #the below code is used for multiprogramming
        #the number of process depends upon the number of cores present System
        #process is used to call multiprogramming
        manager=multiprocessing.Manager() 	
        p1=Process(target=firstprocess)
        p2=Process(target=secondprocess)
        p3=Process(target=thirdprocess)
        p4=Process(target=fourthprocess)
        p5=Process(target=fifthprocess)
        #p1.start() is used to start the thread execution
        p1.start()
        p2.start()
        p3.start()
        p4.start()
        p5.start()
        #After completion all the threads are joined
        p1.join()
        p2.join()
        p3.join()
        p4.join()
        p5.join()
    
    if __name__=="__main__":
        main()
    
    In [ ]:
    # asmoutputfile.csv(output genarated from the above two cells) will contain all the extracted features from .asm files
    # we will use this file directly
    dfasm=pd.read_csv("asmoutputfile.csv")
    Y.columns = ['ID', 'Class']
    result_asm = pd.merge(dfasm, Y,on='ID', how='left')
    result_asm.head()
    
    Out[ ]:
    ID HEADER: .text: .Pav: .idata: .data: .bss: .rdata: .edata: .rsrc: ... edx esi eax ebx ecx edi ebp esp eip Class
    0 01kcPWA9K2BOxQeS5Rju 19 744 0 127 57 0 323 0 3 ... 18 66 15 43 83 0 17 48 29 1
    1 1E93CpP60RHFNiT5Qfvn 17 838 0 103 49 0 0 0 3 ... 18 29 48 82 12 0 14 0 20 1
    2 3ekVow2ajZHbTnBcsDfX 17 427 0 50 43 0 145 0 3 ... 13 42 10 67 14 0 11 0 9 1
    3 3X2nY7iQaPBIWDrAZqJe 17 227 0 43 19 0 0 0 3 ... 6 8 14 7 2 0 8 0 6 1
    4 46OZzdsSKDCFV8h7XWxf 17 402 0 59 170 0 0 0 3 ... 12 9 18 29 5 0 11 0 11 1

    5 rows × 53 columns

    22. Files sizes of each .asm file as a feature

    Back to the top

    In [ ]:
    # file sizes of asm files
    
    files=os.listdir('asmFiles')
    filenames=Y['ID'].tolist()
    class_y=Y['Class'].tolist()
    class_bytes=[]
    sizebytes=[]
    fnames=[]
    for file in files:
        # print(os.stat('byteFiles/0A32eTdBKayjCWhZqDOQ.txt'))
        # os.stat_result(st_mode=33206, st_ino=1125899906874507, st_dev=3561571700, st_nlink=1, st_uid=0, st_gid=0, 
        # st_size=3680109, st_atime=1519638522, st_mtime=1519638522, st_ctime=1519638522)
        # read more about os.stat: here https://www.tutorialspoint.com/python/os_stat.htm
        statinfo=os.stat('asmFiles/'+file)
        # split the file name at '.' and take the first part of it i.e the file name
        file=file.split('.')[0]
        if any(file == filename for filename in filenames):
            i=filenames.index(file)
            class_bytes.append(class_y[i])
            # converting into Mb's
            sizebytes.append(statinfo.st_size/(1024.0*1024.0))
            fnames.append(file)
    asm_size_byte=pd.DataFrame({'ID':fnames,'size':sizebytes,'Class':class_bytes})
    print (asm_size_byte.head())
    
       Class                    ID       size
    0      9  01azqd4InC7m9JpocGv5  56.229886
    1      2  01IsoiSMh5gxyDYTl4CB  13.999378
    2      9  01jsnpXSAlgw6aPeDxrU   8.507785
    3      1  01kcPWA9K2BOxQeS5Rju   0.078190
    4      8  01SuzwMJEIXsK7A8dQbl   0.996723
    

    4.2.1.2 Distribution of .asm file sizes

    In [ ]:
    #boxplot of asm files
    ax = sns.boxplot(x="Class", y="size", data=asm_size_byte)
    plt.title("boxplot of .bytes file sizes")
    plt.show()
    
    In [ ]:
    # add the file size feature to previous extracted features
    print(result_asm.shape)
    print(asm_size_byte.shape)
    result_asm = pd.merge(result_asm, asm_size_byte.drop(['Class'], axis=1),on='ID', how='left')
    result_asm.head()
    
    (10868, 53)
    (10868, 3)
    
    Out[ ]:
    ID HEADER: .text: .Pav: .idata: .data: .bss: .rdata: .edata: .rsrc: ... esi eax ebx ecx edi ebp esp eip Class size
    0 01kcPWA9K2BOxQeS5Rju 19 744 0 127 57 0 323 0 3 ... 66 15 43 83 0 17 48 29 1 0.078190
    1 1E93CpP60RHFNiT5Qfvn 17 838 0 103 49 0 0 0 3 ... 29 48 82 12 0 14 0 20 1 0.063400
    2 3ekVow2ajZHbTnBcsDfX 17 427 0 50 43 0 145 0 3 ... 42 10 67 14 0 11 0 9 1 0.041695
    3 3X2nY7iQaPBIWDrAZqJe 17 227 0 43 19 0 0 0 3 ... 8 14 7 2 0 8 0 6 1 0.018757
    4 46OZzdsSKDCFV8h7XWxf 17 402 0 59 170 0 0 0 3 ... 9 18 29 5 0 11 0 11 1 0.037567

    5 rows × 54 columns

    In [ ]:
    # we normalize the data each column 
    result_asm = normalize(result_asm)
    result_asm.head()
    
    Out[ ]:
    ID HEADER: .text: .Pav: .idata: .data: .bss: .rdata: .edata: .rsrc: ... esi eax ebx ecx edi ebp esp eip Class size
    0 01kcPWA9K2BOxQeS5Rju 0.107345 0.001092 0.0 0.000761 0.000023 0.0 0.000084 0.0 0.000072 ... 0.000746 0.000301 0.000360 0.001057 0.0 0.030797 0.001468 0.003173 1 0.000432
    1 1E93CpP60RHFNiT5Qfvn 0.096045 0.001230 0.0 0.000617 0.000019 0.0 0.000000 0.0 0.000072 ... 0.000328 0.000965 0.000686 0.000153 0.0 0.025362 0.000000 0.002188 1 0.000327
    2 3ekVow2ajZHbTnBcsDfX 0.096045 0.000627 0.0 0.000300 0.000017 0.0 0.000038 0.0 0.000072 ... 0.000475 0.000201 0.000560 0.000178 0.0 0.019928 0.000000 0.000985 1 0.000172
    3 3X2nY7iQaPBIWDrAZqJe 0.096045 0.000333 0.0 0.000258 0.000008 0.0 0.000000 0.0 0.000072 ... 0.000090 0.000281 0.000059 0.000025 0.0 0.014493 0.000000 0.000657 1 0.000009
    4 46OZzdsSKDCFV8h7XWxf 0.096045 0.000590 0.0 0.000353 0.000068 0.0 0.000000 0.0 0.000072 ... 0.000102 0.000362 0.000243 0.000064 0.0 0.019928 0.000000 0.001204 1 0.000143

    5 rows × 54 columns

    23. Univariate analysis ONLY on .asm file features

    Back to the top

    In [ ]:
    ax = sns.boxplot(x="Class", y=".text:", data=result_asm)
    plt.title("boxplot of .asm text segment")
    plt.show()
    
    The plot is between Text and class 
    Class 1,2 and 9 can be easly separated
    
    In [ ]:
    ax = sns.boxplot(x="Class", y=".Pav:", data=result_asm)
    plt.title("boxplot of .asm pav segment")
    plt.show()
    
    In [ ]:
    ax = sns.boxplot(x="Class", y=".data:", data=result_asm)
    plt.title("boxplot of .asm data segment")
    plt.show()
    
    The plot is between data segment and class label 
    class 6 and class 9 can be easily separated from given points
    
    In [ ]:
    ax = sns.boxplot(x="Class", y=".bss:", data=result_asm)
    plt.title("boxplot of .asm bss segment")
    plt.show()
    
    plot between bss segment and class label
    very less number of files are having bss segment
    
    In [ ]:
    ax = sns.boxplot(x="Class", y=".rdata:", data=result_asm)
    plt.title("boxplot of .asm rdata segment")
    plt.show()
    

    Imgur

    Plot between rdata segment and Class segment
    Class 2 can be easily separated 75 pecentile files are having 1M rdata lines
    
    In [ ]:
    ax = sns.boxplot(x="Class", y="jmp", data=result_asm)
    plt.title("boxplot of .asm jmp opcode")
    plt.show()
    
    plot between jmp and Class label
    Class 1 is having frequency of 2000 approx in 75 perentile of files
    
    In [ ]:
    ax = sns.boxplot(x="Class", y="mov", data=result_asm)
    plt.title("boxplot of .asm mov opcode")
    plt.show()
    
    plot between Class label and mov opcode
    Class 1 is having frequency of 2000 approx in 75 perentile of files
    
    In [ ]:
    ax = sns.boxplot(x="Class", y="retf", data=result_asm)
    plt.title("boxplot of .asm retf opcode")
    plt.show()
    
    plot between Class label and retf
    Class 6 can be easily separated with opcode retf
    The frequency of retf is approx of 250.
    
    In [ ]:
    ax = sns.boxplot(x="Class", y="push", data=result_asm)
    plt.title("boxplot of .asm push opcode")
    plt.show()
    
    plot between push opcode and Class label
    Class 1 is having 75 precentile files with push opcodes of frequency 1000
    

    24. Multivariate Analysis ONLY on .asm file features

    Back to the top

    In [ ]:
    #multivariate analysis on asm files
    #this is with perplexity 50
    xtsne=TSNE(perplexity=50)
    results=xtsne.fit_transform(result_asm.drop(['ID','Class'], axis=1).fillna(0))
    
    vis_x = results[:, 0]
    vis_y = results[:, 1   ]
    plt.scatter(vis_x, vis_y, c=data_y, cmap=plt.cm.get_cmap("jet", 9))
    plt.colorbar(ticks=range(10))
    plt.clim(0.5, 9)
    plt.show()
    
    In [ ]:
    # by univariate analysis on the .asm file features we are getting very negligible information from 
    # 'rtn', '.BSS:' '.CODE' features, so heare we are trying multivariate analysis after removing those features
    # the plot looks very messy
    
    xtsne=TSNE(perplexity=30)
    results=xtsne.fit_transform(result_asm.drop(['ID','Class', 'rtn', '.BSS:', '.CODE','size'], axis=1))
    vis_x = results[:, 0]
    vis_y = results[:, 1]
    plt.scatter(vis_x, vis_y, c=data_y, cmap=plt.cm.get_cmap("jet", 9))
    plt.colorbar(ticks=range(10))
    plt.clim(0.5, 9)
    plt.show()
    
    TSNE for asm data with perplexity 50
    

    25. Conclusion on EDA ( ONLY on .asm file features)

    Back to the top

  • We have taken only 52 features from asm files (after reading through many blogs and research papers)
  • The univariate analysis was done only on few important features.
  • Take-aways
    • 1. Class 3 can be easily separated because of the frequency of segments,opcodes and keywords being less
    • 2. Each feature has its unique importance in separating the Class labels.
  • 26. Train and test split ( ONLY on .asm file featues )

    Back to the top

    In [ ]:
    asm_y = result_asm['Class']
    asm_x = result_asm.drop(['ID','Class','.BSS:','rtn','.CODE'], axis=1)
    
    In [ ]:
    X_train_asm, X_test_asm, y_train_asm, y_test_asm = train_test_split(asm_x,asm_y ,stratify=asm_y,test_size=0.20)
    X_train_asm, X_cv_asm, y_train_asm, y_cv_asm = train_test_split(X_train_asm, y_train_asm,stratify=y_train_asm,test_size=0.20)
    
    In [ ]:
    print( X_cv_asm.isnull().all())
    
    HEADER:    False
    .text:     False
    .Pav:      False
    .idata:    False
    .data:     False
    .bss:      False
    .rdata:    False
    .edata:    False
    .rsrc:     False
    .tls:      False
    .reloc:    False
    jmp        False
    mov        False
    retf       False
    push       False
    pop        False
    xor        False
    retn       False
    nop        False
    sub        False
    inc        False
    dec        False
    add        False
    imul       False
    xchg       False
    or         False
    shr        False
    cmp        False
    call       False
    shl        False
    ror        False
    rol        False
    jnb        False
    jz         False
    lea        False
    movzx      False
    .dll       False
    std::      False
    :dword     False
    edx        False
    esi        False
    eax        False
    ebx        False
    ecx        False
    edi        False
    ebp        False
    esp        False
    eip        False
    size       False
    dtype: bool
    

    27. K-Nearest Neigbors ONLY on .asm file features

    Back to the top

    In [ ]:
    # find more about KNeighborsClassifier() here http://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html
    # -------------------------
    # default parameter
    # KNeighborsClassifier(n_neighbors=5, weights=’uniform’, algorithm=’auto’, leaf_size=30, p=2, 
    # metric=’minkowski’, metric_params=None, n_jobs=1, **kwargs)
    
    # methods of
    # fit(X, y) : Fit the model using X as training data and y as target values
    # predict(X):Predict the class labels for the provided data
    # predict_proba(X):Return probability estimates for the test data X.
    
    
    # find more about CalibratedClassifierCV here at http://scikit-learn.org/stable/modules/generated/sklearn.calibration.CalibratedClassifierCV.html
    # ----------------------------
    # default paramters
    # sklearn.calibration.CalibratedClassifierCV(base_estimator=None, method=’sigmoid’, cv=3)
    #
    # some of the methods of CalibratedClassifierCV()
    # fit(X, y[, sample_weight])	Fit the calibrated model
    # get_params([deep])	Get parameters for this estimator.
    # predict(X)	Predict the target of new samples.
    # predict_proba(X)	Posterior probabilities of classification
    
    
    alpha = [x for x in range(1, 21,2)]
    cv_log_error_array=[]
    for i in alpha:
        k_clf=KNeighborsClassifier(n_neighbors=i)
        k_clf.fit(X_train_asm,y_train_asm)
        sig_clf = CalibratedClassifierCV(k_clf, method="sigmoid")
        sig_clf.fit(X_train_asm, y_train_asm)
        predict_y = sig_clf.predict_proba(X_cv_asm)
        cv_log_error_array.append(log_loss(y_cv_asm, predict_y, labels=k_clf.classes_, eps=1e-15))
        
    for i in range(len(cv_log_error_array)):
        print ('log_loss for k = ',alpha[i],'is',cv_log_error_array[i])
    
    best_alpha = np.argmin(cv_log_error_array)
        
    fig, ax = plt.subplots()
    ax.plot(alpha, cv_log_error_array,c='g')
    for i, txt in enumerate(np.round(cv_log_error_array,3)):
        ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
    plt.grid()
    plt.title("Cross Validation Error for each alpha")
    plt.xlabel("Alpha i's")
    plt.ylabel("Error measure")
    plt.show()
    
    k_clf=KNeighborsClassifier(n_neighbors=alpha[best_alpha])
    k_clf.fit(X_train_asm,y_train_asm)
    sig_clf = CalibratedClassifierCV(k_clf, method="sigmoid")
    sig_clf.fit(X_train_asm, y_train_asm)
    pred_y=sig_clf.predict(X_test_asm)
    
    
    predict_y = sig_clf.predict_proba(X_train_asm)
    print ('log loss for train data',log_loss(y_train_asm, predict_y))
    predict_y = sig_clf.predict_proba(X_cv_asm)
    print ('log loss for cv data',log_loss(y_cv_asm, predict_y))
    predict_y = sig_clf.predict_proba(X_test_asm)
    print ('log loss for test data',log_loss(y_test_asm, predict_y))
    plot_confusion_matrix(y_test_asm,sig_clf.predict(X_test_asm))
    
    log_loss for k =  1 is 0.104531321344
    log_loss for k =  3 is 0.0958800580948
    log_loss for k =  5 is 0.0995466557335
    log_loss for k =  7 is 0.107227274345
    log_loss for k =  9 is 0.119239543547
    log_loss for k =  11 is 0.133926642781
    log_loss for k =  13 is 0.147643793967
    log_loss for k =  15 is 0.159439699615
    log_loss for k =  17 is 0.16878376444
    log_loss for k =  19 is 0.178020728839
    
    log loss for train data 0.0476773462198
    log loss for cv data 0.0958800580948
    log loss for test data 0.0894810720832
    Number of misclassified points  2.02391904324
    -------------------------------------------------- Confusion matrix --------------------------------------------------
    
    -------------------------------------------------- Precision matrix --------------------------------------------------
    
    Sum of columns in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]
    -------------------------------------------------- Recall matrix --------------------------------------------------
    
    Sum of rows in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]
    

    28. Logistic Regression ONLY on .asm file features

    Back to the top

    In [ ]:
    # read more about SGDClassifier() at http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDClassifier.html
    # ------------------------------
    # default parameters
    # SGDClassifier(loss=’hinge’, penalty=’l2’, alpha=0.0001, l1_ratio=0.15, fit_intercept=True, max_iter=None, tol=None, 
    # shuffle=True, verbose=0, epsilon=0.1, n_jobs=1, random_state=None, learning_rate=’optimal’, eta0=0.0, power_t=0.5, 
    # class_weight=None, warm_start=False, average=False, n_iter=None)
    
    # some of methods
    # fit(X, y[, coef_init, intercept_init, …])	Fit linear model with Stochastic Gradient Descent.
    # predict(X)	Predict class labels for samples in X.
    
    
    alpha = [10 ** x for x in range(-5, 4)]
    cv_log_error_array=[]
    for i in alpha:
        logisticR=LogisticRegression(penalty='l2',C=i,class_weight='balanced')
        logisticR.fit(X_train_asm,y_train_asm)
        sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
        sig_clf.fit(X_train_asm, y_train_asm)
        predict_y = sig_clf.predict_proba(X_cv_asm)
        cv_log_error_array.append(log_loss(y_cv_asm, predict_y, labels=logisticR.classes_, eps=1e-15))
        
    for i in range(len(cv_log_error_array)):
        print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
    
    best_alpha = np.argmin(cv_log_error_array)
        
    fig, ax = plt.subplots()
    ax.plot(alpha, cv_log_error_array,c='g')
    for i, txt in enumerate(np.round(cv_log_error_array,3)):
        ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
    plt.grid()
    plt.title("Cross Validation Error for each alpha")
    plt.xlabel("Alpha i's")
    plt.ylabel("Error measure")
    plt.show()
    
    logisticR=LogisticRegression(penalty='l2',C=alpha[best_alpha],class_weight='balanced')
    logisticR.fit(X_train_asm,y_train_asm)
    sig_clf = CalibratedClassifierCV(logisticR, method="sigmoid")
    sig_clf.fit(X_train_asm, y_train_asm)
    
    predict_y = sig_clf.predict_proba(X_train_asm)
    print ('log loss for train data',(log_loss(y_train_asm, predict_y, labels=logisticR.classes_, eps=1e-15)))
    predict_y = sig_clf.predict_proba(X_cv_asm)
    print ('log loss for cv data',(log_loss(y_cv_asm, predict_y, labels=logisticR.classes_, eps=1e-15)))
    predict_y = sig_clf.predict_proba(X_test_asm)
    print ('log loss for test data',(log_loss(y_test_asm, predict_y, labels=logisticR.classes_, eps=1e-15)))
    plot_confusion_matrix(y_test_asm,sig_clf.predict(X_test_asm))
    
    log_loss for c =  1e-05 is 1.58867274165
    log_loss for c =  0.0001 is 1.54560797884
    log_loss for c =  0.001 is 1.30137786807
    log_loss for c =  0.01 is 1.33317456931
    log_loss for c =  0.1 is 1.16705751378
    log_loss for c =  1 is 0.757667807779
    log_loss for c =  10 is 0.546533939819
    log_loss for c =  100 is 0.438414998062
    log_loss for c =  1000 is 0.424423536526
    
    log loss for train data 0.396219394701
    log loss for cv data 0.424423536526
    log loss for test data 0.415685592517
    Number of misclassified points  9.61361545538
    -------------------------------------------------- Confusion matrix --------------------------------------------------
    
    -------------------------------------------------- Precision matrix --------------------------------------------------
    
    Sum of columns in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]
    -------------------------------------------------- Recall matrix --------------------------------------------------
    
    Sum of rows in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]
    

    29. Random Forest Classifier ONLY on .asm file features

    Back to the top

    In [ ]:
    # --------------------------------
    # default parameters 
    # sklearn.ensemble.RandomForestClassifier(n_estimators=10, criterion=’gini’, max_depth=None, min_samples_split=2, 
    # min_samples_leaf=1, min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0, 
    # min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None, verbose=0, warm_start=False, 
    # class_weight=None)
    
    # Some of methods of RandomForestClassifier()
    # fit(X, y, [sample_weight])	Fit the SVM model according to the given training data.
    # predict(X)	Perform classification on samples in X.
    # predict_proba (X)	Perform classification on samples in X.
    
    # some of attributes of  RandomForestClassifier()
    # feature_importances_ : array of shape = [n_features]
    # The feature importances (the higher, the more important the feature).
    
    alpha=[10,50,100,500,1000,2000,3000]
    cv_log_error_array=[]
    for i in alpha:
        r_clf=RandomForestClassifier(n_estimators=i,random_state=42,n_jobs=-1)
        r_clf.fit(X_train_asm,y_train_asm)
        sig_clf = CalibratedClassifierCV(r_clf, method="sigmoid")
        sig_clf.fit(X_train_asm, y_train_asm)
        predict_y = sig_clf.predict_proba(X_cv_asm)
        cv_log_error_array.append(log_loss(y_cv_asm, predict_y, labels=r_clf.classes_, eps=1e-15))
    
    for i in range(len(cv_log_error_array)):
        print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
    
    
    best_alpha = np.argmin(cv_log_error_array)
    
    fig, ax = plt.subplots()
    ax.plot(alpha, cv_log_error_array,c='g')
    for i, txt in enumerate(np.round(cv_log_error_array,3)):
        ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
    plt.grid()
    plt.title("Cross Validation Error for each alpha")
    plt.xlabel("Alpha i's")
    plt.ylabel("Error measure")
    plt.show()
    
    r_clf=RandomForestClassifier(n_estimators=alpha[best_alpha],random_state=42,n_jobs=-1)
    r_clf.fit(X_train_asm,y_train_asm)
    sig_clf = CalibratedClassifierCV(r_clf, method="sigmoid")
    sig_clf.fit(X_train_asm, y_train_asm)
    predict_y = sig_clf.predict_proba(X_train_asm)
    print ('log loss for train data',(log_loss(y_train_asm, predict_y, labels=sig_clf.classes_, eps=1e-15)))
    predict_y = sig_clf.predict_proba(X_cv_asm)
    print ('log loss for cv data',(log_loss(y_cv_asm, predict_y, labels=sig_clf.classes_, eps=1e-15)))
    predict_y = sig_clf.predict_proba(X_test_asm)
    print ('log loss for test data',(log_loss(y_test_asm, predict_y, labels=sig_clf.classes_, eps=1e-15)))
    plot_confusion_matrix(y_test_asm,sig_clf.predict(X_test_asm))
    
    log_loss for c =  10 is 0.0581657906023
    log_loss for c =  50 is 0.0515443148419
    log_loss for c =  100 is 0.0513084973231
    log_loss for c =  500 is 0.0499021761479
    log_loss for c =  1000 is 0.0497972474298
    log_loss for c =  2000 is 0.0497091690815
    log_loss for c =  3000 is 0.0496706817633
    
    log loss for train data 0.0116517052676
    log loss for cv data 0.0496706817633
    log loss for test data 0.0571239496453
    Number of misclassified points  1.14995400184
    -------------------------------------------------- Confusion matrix --------------------------------------------------
    
    -------------------------------------------------- Precision matrix --------------------------------------------------
    
    Sum of columns in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]
    -------------------------------------------------- Recall matrix --------------------------------------------------
    
    Sum of rows in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]
    

    30. XgBoost Classifier ONLY on .asm file features

    Back to the top

    In [ ]:
    # Training a hyper-parameter tuned Xg-Boost regressor on our train data
    
    # find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
    # -------------------------
    # default paramters
    # class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, 
    # objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, 
    # max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, 
    # scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
    
    # some of methods of RandomForestRegressor()
    # fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
    # get_params([deep])	Get parameters for this estimator.
    # predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
    # get_score(importance_type='weight') -> get the feature importance
    
    alpha=[10,50,100,500,1000,2000,3000]
    cv_log_error_array=[]
    for i in alpha:
        x_clf=XGBClassifier(n_estimators=i,nthread=-1)
        x_clf.fit(X_train_asm,y_train_asm)
        sig_clf = CalibratedClassifierCV(x_clf, method="sigmoid")
        sig_clf.fit(X_train_asm, y_train_asm)
        predict_y = sig_clf.predict_proba(X_cv_asm)
        cv_log_error_array.append(log_loss(y_cv_asm, predict_y, labels=x_clf.classes_, eps=1e-15))
    
    for i in range(len(cv_log_error_array)):
        print ('log_loss for c = ',alpha[i],'is',cv_log_error_array[i])
    
    
    best_alpha = np.argmin(cv_log_error_array)
    
    fig, ax = plt.subplots()
    ax.plot(alpha, cv_log_error_array,c='g')
    for i, txt in enumerate(np.round(cv_log_error_array,3)):
        ax.annotate((alpha[i],np.round(txt,3)), (alpha[i],cv_log_error_array[i]))
    plt.grid()
    plt.title("Cross Validation Error for each alpha")
    plt.xlabel("Alpha i's")
    plt.ylabel("Error measure")
    plt.show()
    
    x_clf=XGBClassifier(n_estimators=alpha[best_alpha],nthread=-1)
    x_clf.fit(X_train_asm,y_train_asm)
    sig_clf = CalibratedClassifierCV(x_clf, method="sigmoid")
    sig_clf.fit(X_train_asm, y_train_asm)
        
    predict_y = sig_clf.predict_proba(X_train_asm)
    
    print ('For values of best alpha = ', alpha[best_alpha], "The train log loss is:",log_loss(y_train_asm, predict_y))
    predict_y = sig_clf.predict_proba(X_cv_asm)
    print('For values of best alpha = ', alpha[best_alpha], "The cross validation log loss is:",log_loss(y_cv_asm, predict_y))
    predict_y = sig_clf.predict_proba(X_test_asm)
    print('For values of best alpha = ', alpha[best_alpha], "The test log loss is:",log_loss(y_test_asm, predict_y))
    plot_confusion_matrix(y_test_asm,sig_clf.predict(X_test_asm))
    
    log_loss for c =  10 is 0.104344888454
    log_loss for c =  50 is 0.0567190635611
    log_loss for c =  100 is 0.056075038646
    log_loss for c =  500 is 0.057336051683
    log_loss for c =  1000 is 0.0571265109903
    log_loss for c =  2000 is 0.057103406781
    log_loss for c =  3000 is 0.0567993215778
    
    For values of best alpha =  100 The train log loss is: 0.0117883742574
    For values of best alpha =  100 The cross validation log loss is: 0.056075038646
    For values of best alpha =  100 The test log loss is: 0.0491647763845
    Number of misclassified points  0.873965041398
    -------------------------------------------------- Confusion matrix --------------------------------------------------
    
    -------------------------------------------------- Precision matrix --------------------------------------------------
    
    Sum of columns in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]
    -------------------------------------------------- Recall matrix --------------------------------------------------
    
    Sum of rows in precision matrix [ 1.  1.  1.  1.  1.  1.  1.  1.  1.]
    

    31. Xgboost Classifier with best hyperparameters ( ONLY on .asm file features )

    Back to the top

    In [ ]:
    x_clf=XGBClassifier()
    
    prams={
        'learning_rate':[0.01,0.03,0.05,0.1,0.15,0.2],
         'n_estimators':[100,200,500,1000,2000],
         'max_depth':[3,5,10],
        'colsample_bytree':[0.1,0.3,0.5,1],
        'subsample':[0.1,0.3,0.5,1]
    }
    random_cfl=RandomizedSearchCV(x_clf,param_distributions=prams,verbose=10,n_jobs=-1,)
    random_cfl.fit(X_train_asm,y_train_asm)
    
    Fitting 3 folds for each of 10 candidates, totalling 30 fits
    
    [Parallel(n_jobs=-1)]: Done   2 tasks      | elapsed:    8.1s
    [Parallel(n_jobs=-1)]: Done   9 tasks      | elapsed:   32.8s
    [Parallel(n_jobs=-1)]: Done  19 out of  30 | elapsed:  1.1min remaining:   39.3s
    [Parallel(n_jobs=-1)]: Done  23 out of  30 | elapsed:  1.3min remaining:   23.0s
    [Parallel(n_jobs=-1)]: Done  27 out of  30 | elapsed:  1.4min remaining:    9.2s
    [Parallel(n_jobs=-1)]: Done  30 out of  30 | elapsed:  2.3min finished
    
    Out[ ]:
    RandomizedSearchCV(cv=None, error_score='raise',
              estimator=XGBClassifier(base_score=0.5, colsample_bylevel=1, colsample_bytree=1,
           gamma=0, learning_rate=0.1, max_delta_step=0, max_depth=3,
           min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
           objective='binary:logistic', reg_alpha=0, reg_lambda=1,
           scale_pos_weight=1, seed=0, silent=True, subsample=1),
              fit_params=None, iid=True, n_iter=10, n_jobs=-1,
              param_distributions={'learning_rate': [0.01, 0.03, 0.05, 0.1, 0.15, 0.2], 'n_estimators': [100, 200, 500, 1000, 2000], 'max_depth': [3, 5, 10], 'colsample_bytree': [0.1, 0.3, 0.5, 1], 'subsample': [0.1, 0.3, 0.5, 1]},
              pre_dispatch='2*n_jobs', random_state=None, refit=True,
              return_train_score=True, scoring=None, verbose=10)
    In [ ]:
    print (random_cfl.best_params_)
    
    {'subsample': 1, 'n_estimators': 200, 'max_depth': 5, 'learning_rate': 0.15, 'colsample_bytree': 0.5}
    
    In [ ]:
    # Training a hyper-parameter tuned Xg-Boost regressor on our train data
    
    # find more about XGBClassifier function here http://xgboost.readthedocs.io/en/latest/python/python_api.html?#xgboost.XGBClassifier
    # -------------------------
    # default paramters
    # class xgboost.XGBClassifier(max_depth=3, learning_rate=0.1, n_estimators=100, silent=True, 
    # objective='binary:logistic', booster='gbtree', n_jobs=1, nthread=None, gamma=0, min_child_weight=1, 
    # max_delta_step=0, subsample=1, colsample_bytree=1, colsample_bylevel=1, reg_alpha=0, reg_lambda=1, 
    # scale_pos_weight=1, base_score=0.5, random_state=0, seed=None, missing=None, **kwargs)
    
    # some of methods of RandomForestRegressor()
    # fit(X, y, sample_weight=None, eval_set=None, eval_metric=None, early_stopping_rounds=None, verbose=True, xgb_model=None)
    # get_params([deep])	Get parameters for this estimator.
    # predict(data, output_margin=False, ntree_limit=0) : Predict with data. NOTE: This function is not thread safe.
    # get_score(importance_type='weight') -> get the feature importance
    
    x_clf=XGBClassifier(n_estimators=200,subsample=0.5,learning_rate=0.15,colsample_bytree=0.5,max_depth=3)
    x_clf.fit(X_train_asm,y_train_asm)
    c_cfl=CalibratedClassifierCV(x_clf,method='sigmoid')
    c_cfl.fit(X_train_asm,y_train_asm)
    
    predict_y = c_cfl.predict_proba(X_train_asm)
    print ('train loss',log_loss(y_train_asm, predict_y))
    predict_y = c_cfl.predict_proba(X_cv_asm)
    print ('cv loss',log_loss(y_cv_asm, predict_y))
    predict_y = c_cfl.predict_proba(X_test_asm)
    print ('test loss',log_loss(y_test_asm, predict_y))
    
    train loss 0.0102661325822
    cv loss 0.0501201796687
    test loss 0.0483908764397
    

    Till now, all the above portion of the code was the basic experimentation with the very basic features of byte and asm file. Now comes the final part of this project


    32. FINAL FEATURIZATION STEPS FOR THE FINAL XGBOOST MODEL TRAINING

    Back to the top

    In [3]:
    # separating byte files and asm files 
    # I am doing slight re-arrangement of the files for this FINAL run of Featurization and model training
    from google.colab import drive
    drive.mount('/content/gdrive')
    
    root_path = '/content/gdrive/MyDrive/Malware/Full_data/'
    # root_path = '../../LARGE_Datasets/'
    
    destination_1 = root_path+'byteFiles'
    destination_2 = root_path+'asmFiles'
    
    Mounted at /content/gdrive
    

    33. Uni-Gram Byte Feature extraction from byte files - For FINAL Model Train

    Back to the top

    This cell's code is what we have already ran earlier in the experimentation part, including below here again for the sake of completeness

    In [ ]:
    %%time
    
    # This cell's code is what we have already ran earlier in the experimentation part
    # Including here again for the sake of completeness
    # removal of addres from byte files
    # contents of .byte files
    # ----------------
    #00401000 56 8D 44 24 08 50 8B F1 E8 1C 1B 00 00 C7 06 08 
    #-------------------
    #we remove the starting address 00401000
    
    files = os.listdir(root_path+'byteFiles/')
    filenames=[]
    array=[]
    for file in tqdm(files):
        if(file.endswith("bytes")):
            file=file.split('.')[0]
            text_file = open(root_path+'byteFiles/'+file+".txt", 'w+')
            with open(root_path+'byteFiles/' + file + '.bytes', 'r') as fp:
                lines=""
                for line in fp:
                    # rstrip()=> Return a copy of the string with trailing characters removed.
                    # Once we have removed trailing characters, invoke split() to return the list of string which are separated by ","
                    # split() specifies the separator to use when splitting the string. By default any whitespace is a separator
                    a=line.rstrip().split(" ")[1:] # [1:] is equivalent to "1 to end" as we are removing 0-th element of address from byte files
                    b=' '.join(a)
                    b=b+"\n" # Python doesn't automatically add line breaks, you need to do that manually
                    text_file.write(b)
                fp.close()
                os.remove(root_path+'byteFiles/'+file+".bytes")
            text_file.close()
    
    files = os.listdir(root_path+'byteFiles/')
    filenames2=[]
    feature_matrix = np.zeros((len(files),257),dtype=int)
    k=0
    
    
    # program to convert into bag of words of bytefiles
    # this is custom-built bag of words this is unigram bag of words
    # This is a Custom Implementation of CountVectorizer as CountVectorizer will NOT suport working on such huge file system of 50GB
    # For this Uni-Gram feature creating and writing to a file named 'result.csv'
    
    byte_feature_file=open(root_path + 'result.csv','w+')
    
    byte_feature_file.write("ID,0,1,2,3,4,5,6,7,8,9,0a,0b,0c,0d,0e,0f,10,11,12,13,14,15,16,17,18,19,1a,1b,1c,1d,1e,1f,20,21,22,23,24,25,26,27,28,29,2a,2b,2c,2d,2e,2f,30,31,32,33,34,35,36,37,38,39,3a,3b,3c,3d,3e,3f,40,41,42,43,44,45,46,47,48,49,4a,4b,4c,4d,4e,4f,50,51,52,53,54,55,56,57,58,59,5a,5b,5c,5d,5e,5f,60,61,62,63,64,65,66,67,68,69,6a,6b,6c,6d,6e,6f,70,71,72,73,74,75,76,77,78,79,7a,7b,7c,7d,7e,7f,80,81,82,83,84,85,86,87,88,89,8a,8b,8c,8d,8e,8f,90,91,92,93,94,95,96,97,98,99,9a,9b,9c,9d,9e,9f,a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,aa,ab,ac,ad,ae,af,b0,b1,b2,b3,b4,b5,b6,b7,b8,b9,ba,bb,bc,bd,be,bf,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,ca,cb,cc,cd,ce,cf,d0,d1,d2,d3,d4,d5,d6,d7,d8,d9,da,db,dc,dd,de,df,e0,e1,e2,e3,e4,e5,e6,e7,e8,e9,ea,eb,ec,ed,ee,ef,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,fa,fb,fc,fd,fe,ff,??")
    byte_feature_file.write("\n")
    
    for file in tqdm(files):
        filenames2.append(file)
        byte_feature_file.write(file+",")
        if(file.endswith("txt")):
            with open(root_path+'byteFiles/'+file,"r") as byte_flie:
                for lines in byte_flie:
                    line=lines.rstrip().split(" ")
                    for hex_code in line:
                        if hex_code=='??':
                            feature_matrix[k][256]+=1
                        else:
                            feature_matrix[k][int(hex_code,16)]+=1
            byte_flie.close()
        for i, row in enumerate(feature_matrix[k]):
            if i!=len(feature_matrix[k])-1:
                byte_feature_file.write(str(row)+",")
            else:
                byte_feature_file.write(str(row))
        byte_feature_file.write("\n")
        
        k += 1
    
    byte_feature_file.close()
    
      0%|          | 0/10868 [00:00<?, ?it/s]
    100%|██████████| 10868/10868 [2:18:19<00:00,  1.31it/s]
    CPU times: user 2h 15min 44s, sys: 38.6 s, total: 2h 16min 22s
    Wall time: 2h 18min 19s
    
    
    
    In [ ]:
    %%time
    
    uni_gram_byte_features = pd.read_csv(root_path + "result.csv")
    
    uni_gram_byte_features['ID']  = uni_gram_byte_features['ID'].str.split('.').str[0]
    
    print('Unigram byte_featues shape ', uni_gram_byte_features.shape)
    
    uni_gram_byte_features.head(2)
    
    Unigram byte_featues shape  (10868, 258)
    CPU times: user 183 ms, sys: 28 ms, total: 211 ms
    Wall time: 523 ms
    
    Out[ ]:
    ID 0 1 2 3 4 5 6 7 8 ... f7 f8 f9 fa fb fc fd fe ff ??
    0 1ESMN0Gc6wRmC9BFPjWy 11548 5532 3238 3364 3256 3321 3159 3312 3275 ... 3126 3203 3144 3155 3296 3182 3264 3173 6820 1490076
    1 82rMDRO53qpfnIL4Hi1Y 19812 698 322 445 564 385 228 234 377 ... 868 975 329 204 273 668 240 239 2193 9008

    2 rows × 258 columns

    34. File sizes of Byte files - Feature Extraction -For FINAL Model Train

    Back to the top

    In [ ]:
    %%time
    
    # This cell's code is what we have already ran earlier in the experimentation part
    # Including here again for the sake of completeness
    Y=pd.read_csv(root_path + "trainLabels.csv")
    
    files=os.listdir(root_path + 'byteFiles')
    
    filenames=Y['Id'].tolist()
    
    class_y=Y['Class'].tolist()
    
    class_bytes=[]
    
    sizebytes=[]
    
    fnames=[]
    
    for file in tqdm(files):
        # print(os.stat('byteFiles/0A32eTdBKayjCWhZqDOQ.txt'))
        # os.stat_result(st_mode=33206, st_ino=1125899906874507, st_dev=3561571700, st_nlink=1, st_uid=0, st_gid=0, 
        # st_size=3680109, st_atime=1519638522, st_mtime=1519638522, st_ctime=1519638522)
        # read more about os.stat: here https://www.tutorialspoint.com/python/os_stat.htm
        statinfo=os.stat(root_path+'byteFiles/'+file)
        # split the file name at '.' and take the first part of it i.e the file name
        file=file.split('.')[0]
        if any(file == filename for filename in filenames):
            i=filenames.index(file)
            class_bytes.append(class_y[i])
            # converting into Mb's
            sizebytes.append(statinfo.st_size/(1024.0*1024.0))
            fnames.append(file)
    
    byte_feature_size=pd.DataFrame({'ID':fnames, 'size':sizebytes,'Class':class_bytes})
    
    print (byte_feature_size.head())
    
    100%|██████████| 10868/10868 [00:03<00:00, 3056.88it/s]
    
                         ID      size  Class
    0  1ESMN0Gc6wRmC9BFPjWy  6.703125      3
    1  82rMDRO53qpfnIL4Hi1Y  0.363281      8
    2  cNqPy69uQHgF3DOU14G7  6.703125      3
    3  cwQYBjsoDvAz5MNK8nCR  1.253906      1
    4  6T907yrYp4XJsGPk82Kh  6.703125      3
    CPU times: user 3.52 s, sys: 95.1 ms, total: 3.61 s
    Wall time: 3.8 s
    

    35. Creating some important Files and Folders, which I shall use later for saving Featuarized versions of .csv files

    Back to the top

    In [ ]:
    if not os.path.isdir(root_path + "featurization"):
        os.makedirs(root_path + "featurization")
    
    
    if not os.path.isdir(root_path + "featurization/featurization_final"):
        os.mkdir(root_path + "featurization/featurization_final")
    
    
    # Creating and writing to a file named "class_labels.pkl" to get class class_labels and ID from byte unigrams dataframe and save it for later use
    
    class_labels=byte_feature_size["Class"]
    
    with open(root_path+'featurization/class_labels.pkl', 'wb') as file:
        pkl.dump(class_labels, file)
    
    '''
    https://www.datacamp.com/community/tutorials/pickle-python-tutorial
    
    To open the file for writing, simply use the open() function. The first argument should be the name of your file. The second argument is 'wb'. The w means that you'll be writing to the file, and b refers to binary mode. This means that the data will be written in the form of byte objects.
    '''
    
    # Load the class class_labels for training with random forest feature selector
    
    with open(root_path+'featurization/class_labels.pkl', 'rb') as file:
        class_labels=pkl.load(file)
    

    36. Merging Unigram of Byte Files + Size of Byte Files to create uni_gram_byte_features__with_size

    Back to the top

    Understanding bi-gram conceptually

    N-grams of texts are extensively used in text mining and natural language processing tasks.

    This is the main concept; words are basic, meaningful elements with the ability to represent a different meaning when they are in a sentence. By this point, we keep in mind that sometimes word groups provide more benefits than only one word when explaining the meaning. Here is our sentence "I read a book about the history of America."

    The machine wants to get the meaning of the sentence by separating it into small pieces. How should it do that?

    1. It can regard words one by one. This is unigram; each word is a gram. "I", "read", "a", "book", "about", "the", "history", "of", "America"

    2. It can regard words two at a time. This is bigram (digram); each two adjacent words create a bigram. "I read", "read a", "a book", "book about", "about the", "the history", "history of", "of America"

    3. It can regard words three at a time. This is trigram; each three adjacent words create a trigram. "I read a", "read a book", "a book about", "book about the", "about the history", "the history of", "history of America"

    Source

    So, An n-gram is a contiguous sequence of n items from a given sample of text or speech. an n-gram of size 1 is referred to as a "unigram"; size 2 is a "bigram"; size 3 is a "trigram". When N>3 this is usually referred to as four grams or five grams and so on.

    Formula to calculate number of N-grams in a sentence.

    If X=Number of words in a given sentence, the number of n-grams for that sentence would be:

    Ngram = X - (n - 1)

    Example:

    Sentence : I want to learn Machine Learning

    Unigram: now calculate number of unigrams in sentence using formula

    here, X = 6 and N = 1 (for unigram)

    Ngram = X - (N - 1)

    Ngram = 6 - (1–1) = 6 (i.e. unigram is equal to number of words in a sentence)

    Biagram:

    here, X = 6 and N = 2 (for biagram)

    Ngramk = X - (N - 1)

    Ngramk = 6 - (2–1) = 5

    In [ ]:
    %%time
    
    uni_gram_byte_features__with_size = uni_gram_byte_features.merge(byte_feature_size, on="ID")
    
    uni_gram_byte_features__with_size.to_csv(root_path + "featurization/uni_gram_byte_features__with_size.csv", index=False)
    
    uni_gram_byte_features__with_size = normalize(uni_gram_byte_features__with_size)
    
    In [ ]:
    %%time
    
    from sklearn.feature_extraction.text import CountVectorizer
    
    bigram_tokens="00,01,02,03,04,05,06,07,08,09,0a,0b,0c,0d,0e,0f,10,11,12,13,14,15,16,17,18,19,1a,1b,1c,1d,1e,1f,20,21,22,23,24,25,26,27,28,29,2a,\
    2b,2c,2d,2e,2f,30,31,32,33,34,35,36,37,38,39,3a,3b,3c,3d,3e,3f,40,41,42,43,44,45,46,47,48,49,4a,4b,4c,4d,4e,4f,50,51,52,53,54,55,56,57,58,\
    59,5a,5b,5c,5d,5e,5f,60,61,62,63,64,65,66,67,68,69,6a,6b,6c,6d,6e,6f,70,71,72,73,74,75,76,77,78,79,7a,7b,7c,7d,7e,7f,80,81,82,83,84,85,86,\
    87,88,89,8a,8b,8c,8d,8e,8f,90,91,92,93,94,95,96,97,98,99,9a,9b,9c,9d,9e,9f,a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,aa,ab,ac,ad,ae,af,b0,b1,b2,b3,b4,b5,\
    b6,b7,b8,b9,ba,bb,bc,bd,be,bf,c0,c1,c2,c3,c4,c5,c6,c7,c8,c9,ca,cb,cc,cd,ce,cf,d0,d1,d2,d3,d4,d5,d6,d7,d8,d9,da,db,dc,dd,de,df,e0,e1,e2,e3,e4,\
    e5,e6,e7,e8,e9,ea,eb,ec,ed,ee,ef,f0,f1,f2,f3,f4,f5,f6,f7,f8,f9,fa,fb,fc,fd,fe,ff,??"
    
    bigram_tokens=bigram_tokens.split(",")
    
    # Between 00 and FF there are 256 unique values, so if we take each pair of Hexadecimal Values as one word, 
    # we are dealing with 256 unique values. 
    # Hence below Function will extract all the possible combinations of bigrams_counts
    def calculate_bigram(bigram_tokens):
        sentence=""
        vocabulary_list_for_byte_bigrams=[]
        for i in tqdm(range(len(bigram_tokens))):
            for j in range(len(bigram_tokens)):
                bigram=bigram_tokens[i]+" "+bigram_tokens[j]
                sentence=sentence+bigram+","
                vocabulary_list_for_byte_bigrams.append(bigram)
        return vocabulary_list_for_byte_bigrams
    
    vocabulary_list_for_byte_bigrams = calculate_bigram(bigram_tokens) 
    
    100%|██████████| 257/257 [00:00<00:00, 426.48it/s] 
    CPU times: user 600 ms, sys: 7.48 ms, total: 607 ms
    Wall time: 605 ms
    
    
    

    37. Bi-Gram Byte Feature extraction from byte files

    Back to the top

    In [ ]:
    %%time
    
    import scipy
    vectorizer = CountVectorizer(tokenizer=lambda x: x.split(),lowercase=False, ngram_range=(2,2),vocabulary=vocabulary_list_for_byte_bigrams) 
    
    # For Explanations on "tokenizer=lambda x: x.split()"
    # Refer - https://stackoverflow.com/a/37884104/1902852
    # Without this "??" was not getting vectorized properly
    
    file_list_byte_files=os.listdir(root_path + 'byteFiles')
    
    features=["ID"]+vectorizer.get_feature_names()
    
    byte_file_bigram_df=pd.DataFrame(columns=features)
    
    # Creating "featurization/byte_files_bigram_df.csv" and writng to it the full bi-gram data frame
    with open(root_path + "featurization/byte_files_bigram_df.csv", mode='w') as byte_file_bigram_df:
        byte_file_bigram_df.write(','.join(map(str, features)))
        byte_file_bigram_df.write('\n')
        for _, file in tqdm(enumerate(file_list_byte_files)):
            file_id=file.split(".")[0] #ID of each file
            file = open(root_path + 'byteFiles/' + file)
            corpus_byte_codes=[file.read().replace('\n', ' ').lower()] # corpus_byte_codes holds all the byte codes for a given file
            bigrams_counts = vectorizer.transform(corpus_byte_codes) # Returning a sparse vector containing all the bigram counts from the corpus_byte_codes
            
            # Update each row of our dataframe with the bigram counts of the respective file
            row = scipy.sparse.csr_matrix(bigrams_counts).toarray() 
            
            # Write a single row in the CSV file
            byte_file_bigram_df.write(','.join(map(str, [file_id]+list(row[0]))))
            
            byte_file_bigram_df.write('\n')
            
            file.close()
    
    100%|██████████| 257/257 [00:00<00:00, 421.69it/s]
    10868it [1:56:17,  1.56it/s]
    CPU times: user 1h 39min 47s, sys: 8min 22s, total: 1h 48min 10s
    Wall time: 1h 56min 20s
    
    
    

    38. Extracting the 2000 Most Important Features from Byte bigrams using SelectKBest with Chi-Square Test

    Back to the top

    In [ ]:
    %%time
    
    # Load the byte_files_bigram_df.csv file which is NOT normalized dataset for the byte file's bigrams
    # that we created in the previous cell
    X_byte_bigram_all_df = pd.read_csv(root_path + "featurization/byte_files_bigram_df.csv")
    
    X_byte_bigram_all_df.head(2)
    
    CPU times: user 12min 44s, sys: 0 ns, total: 12min 44s
    Wall time: 12min 59s
    
    Out[ ]:
    ID 00 00 00 01 00 02 00 03 00 04 00 05 00 06 00 07 00 08 ... ?? f7 ?? f8 ?? f9 ?? fa ?? fb ?? fc ?? fd ?? fe ?? ff ?? ??
    0 1ESMN0Gc6wRmC9BFPjWy 6557 33 24 73 17 58 21 41 26 ... 0 0 0 0 0 0 0 0 0 1490068
    1 82rMDRO53qpfnIL4Hi1Y 16299 63 21 124 8 10 6 6 7 ... 0 0 0 0 0 0 0 0 0 9000

    2 rows × 66050 columns

    In [ ]:
    %%time
    
    
    from sklearn.feature_selection import SelectKBest, chi2, f_regression
    
    select_kbest_object = SelectKBest(score_func=chi2, k=2000)
    # SelectKBest scores the features using a function, which is chi2 here
    # Then "removes all but the k highest scoring features"
    
    # Need to remove "ID" column, else will get below error 
    # "SelectKBest fit: ValueError: could not convert string to float"
    
    most_imp_features_byte_bigram = select_kbest_object.fit(X_byte_bigram_all_df.drop("ID", axis=1), class_labels)
    
    # most_imp_features_byte_bigram.scores_ => gives an array of form 
    # array([9.79531407e+05, 4.26642398e+04, 1.78812060e+04, ..., 4.33426736e+07])
    # So now creating a df from this array
    most_imp_byte_bigram_feature_score_df = pd.DataFrame(most_imp_features_byte_bigram.scores_)
    
    # Creating a df from all the column names from the original full X_byte_bigram_all_df df
    most_imp_byte_bigram_columns_df = pd.DataFrame(X_byte_bigram_all_df.columns)
    
    CPU times: user 15.6 s, sys: 1.52 s, total: 17.1 s
    Wall time: 1min 12s
    
    In [ ]:
    # Concat the feature scores along with the feature names in a byte_bigram_df_important_feature_score, 
    # From this we will get all feature names later, to be matched against X_byte_bigram_all_df - to extract ONLY the best features from the bigrams df data
    byte_bigram_df_important_feature_score = pd.concat([most_imp_byte_bigram_columns_df, most_imp_byte_bigram_feature_score_df],axis=1)
    
    byte_bigram_df_important_feature_score.columns = ["Byte Bigram Top 2000 Feature Names","Byte Bigram Top 2000 Feature Score"]
    
    # Find the top 2000 features along with their scores
    
    # byte_bigram_df_important_feature_score=byte_bigram_df_important_feature_score.nlargest(1000, "Byte Bigram Top 2000 Feature Score")
    
    # Return the first 2000 rows with the largest values in the specified column ( "Byte Bigram Top 2000 Feature Score" )
    # in descending order. The columns that are not specified are returned as well, but not used for ordering.
    # Let's look at the top 10 features along with their scores + Save the feature score DF
    byte_bigram_df_important_feature_score = byte_bigram_df_important_feature_score.nlargest(10, "Byte Bigram Top 2000 Feature Score")
    
    byte_bigram_df_important_feature_score.head(2)
    
    Out[ ]:
    Byte Bigram Top 2000 Feature Names Byte Bigram Top 2000 Feature Score
    33220 81 42 1.256095e+07
    17348 43 80 1.248059e+07
    In [ ]:
    # Getting the list of first 2000 feature names
    top_2000_most_imp_byte_bigram_feature_names = list(byte_bigram_df_important_feature_score["Byte Bigram Top 2000 Feature Names"])
    
    # top_2000_byte_bigram_features = dd.concat([X_byte_bigram_all_df["ID"], X_byte_bigram_all_df[top_2000_most_imp_byte_bigram]], axis=1)
    top_2000_byte_bigram_features = pd.concat([X_byte_bigram_all_df["ID"], X_byte_bigram_all_df[top_2000_most_imp_byte_bigram_feature_names]], axis=1)
    
    top_2000_byte_bigram_features.to_csv(root_path + "featurization/featurization_final/top_2000_imp_byte_bigram_df.csv",index=None)
    
    print(top_2000_byte_bigram_features.shape)
    top_2000_byte_bigram_features.head(2)
    
    (10868, 11)
    
    Out[ ]:
    ID 81 42 43 80 2d 2c 43 59 5a 42 43 6c 6d 42 82 81 2b 42 43 2a
    0 1ESMN0Gc6wRmC9BFPjWy 10 16 14 12 12 21 12 16 16 13
    1 82rMDRO53qpfnIL4Hi1Y 0 1 1 2 2 5 4 0 2 1

    39. ASM Unigram - Top 52 Unigram Features from ASM Files - Final Model Training

    Back to the top

    There are 10868 files of asm files which make up about 150 GB

    The asm files contains :

    1. Address
    2. Segments
    3. Opcodes
    4. Registers
    5. function calls
    6. APIs

    Earlier we already extracted all the features with the help of parallel processing. Here we extracted 52 features from all the asm files which are important.

    The asmoutputfile.csv was generated after extracting the unigram features from the .asm files which was ~150GB of data. Took around 48 Hours to process, and we can directly use this file here.

    In [ ]:
    # First read the file that was generated above code
    # Meaning, the code that ran for around 48 hours as mentioned above.
    dfasm=pd.read_csv(root_path + "/asmoutputfile.csv")
    
    Y.columns = ['ID', 'Class'] 
    # Note, Y is all the Train Labels of 0 to 9 which has been defined earlier as below
    # Y = pd.read_csv(root_path + "trainLabels.csv")
    
    unigram_asm = pd.merge(dfasm, Y, on='ID', how='left')
    
    unigram_asm = normalize(unigram_asm)
    
    unigram_asm.head()
    
    Out[ ]:
    ID HEADER: .text: .Pav: .idata: .data: .bss: .rdata: .edata: .rsrc: ... edx esi eax ebx ecx edi ebp esp eip Class
    0 01kcPWA9K2BOxQeS5Rju 0.107345 0.001092 0.0 0.000761 0.000023 0.0 0.000084 0.0 0.000072 ... 0.000343 0.000746 0.000301 0.000360 0.001057 0.0 0.030797 0.001468 0.003173 1
    1 1E93CpP60RHFNiT5Qfvn 0.096045 0.001230 0.0 0.000617 0.000019 0.0 0.000000 0.0 0.000072 ... 0.000343 0.000328 0.000965 0.000686 0.000153 0.0 0.025362 0.000000 0.002188 1
    2 3ekVow2ajZHbTnBcsDfX 0.096045 0.000627 0.0 0.000300 0.000017 0.0 0.000038 0.0 0.000072 ... 0.000248 0.000475 0.000201 0.000560 0.000178 0.0 0.019928 0.000000 0.000985 1
    3 3X2nY7iQaPBIWDrAZqJe 0.096045 0.000333 0.0 0.000258 0.000008 0.0 0.000000 0.0 0.000072 ... 0.000114 0.000090 0.000281 0.000059 0.000025 0.0 0.014493 0.000000 0.000657 1
    4 46OZzdsSKDCFV8h7XWxf 0.096045 0.000590 0.0 0.000353 0.000068 0.0 0.000000 0.0 0.000072 ... 0.000229 0.000102 0.000362 0.000243 0.000064 0.0 0.019928 0.000000 0.001204 1

    5 rows × 53 columns

    40. File Size of ASM Files - Feature Extraction - Final Model Training

    Back to the top

    This cell's code is what we have already ran earlier in the experimentation part, including below here again for the sake of completeness

    In [ ]:
    # file sizes of byte files
    # This code is very much similar to what has been used to extract sizes of 
    # byte files earlier.
    files=os.listdir(root_path + 'asmFiles')
    
    filenames=Y['ID'].tolist()
    
    class_y=Y['Class'].tolist()
    
    class_bytes=[]
    
    sizebytes=[]
    
    fnames=[]
    
    for file in tqdm(files):
        # print(os.stat('byteFiles/0A32eTdBKayjCWhZqDOQ.txt'))
        # os.stat_result(st_mode=33206, st_ino=1125899906874507, st_dev=3561571700, st_nlink=1, st_uid=0, st_gid=0, 
        # st_size=3680109, st_atime=1519638522, st_mtime=1519638522, st_ctime=1519638522)
        # read more about os.stat: here https://www.tutorialspoint.com/python/os_stat.htm
        statinfo=os.stat(root_path + 'asmFiles/'+file)
        # split the file name at '.' and take the first part of it i.e the file name
        file=file.split('.')[0]
        if any(file == filename for filename in filenames):
            i=filenames.index(file)
            class_bytes.append(class_y[i])
            # converting into Mb's
            sizebytes.append(statinfo.st_size/(1024.0*1024.0))
            fnames.append(file)
    
    asm_file_size=pd.DataFrame({'ID':fnames,'size':sizebytes,'Class':class_bytes})
    
    # asm_file_size.to_csv(root_path + "featurization/asm_file_size.csv", index=False)
    
    asm_file_size.head()
    
     97%|█████████▋| 10514/10868 [00:03<00:00, 3169.50it/s]
    
    Out[ ]:
    ID size Class
    0 asoPA4pgUtHm0dQzn9Tb 0.452068 1
    1 gCBJQKMq14Atfe3ZSRX9 0.179508 3
    2 ExVywGIrOR8UiNtujSXh 1.744554 6
    3 KgiNFPOsZn08u9BEhWyx 6.686383 7
    4 29NR1zBEDCPM5xntsdlA 4.044003 1

    41. Merging ASM Unigram + ASM File Size

    Back to the top

    In [ ]:
    unigram_asm_feature__with_size=pd.merge(asm_file_size, unigram_asm.drop(columns=["Class"]),on='ID', how='left')
    
    unigram_asm_feature__with_size.to_csv(root_path + "featurization/unigram_asm_feature__with_size")
    
    unigram_asm_feature__with_size.head()
    
    Out[ ]:
    ID size Class HEADER: .text: .Pav: .idata: .data: .bss: .rdata: ... :dword edx esi eax ebx ecx edi ebp esp eip
    0 asoPA4pgUtHm0dQzn9Tb 0.452068 1 0.107345 0.005452 0.000000 0.001546 0.000842 0.0 0.00088 ... 0.005370 0.002061 0.004045 0.005607 0.003396 0.001478 0.0 0.038043 0.000000 0.007659
    1 gCBJQKMq14Atfe3ZSRX9 0.179508 3 0.096045 0.002121 0.000000 0.000000 0.000419 0.0 0.00039 ... 0.000915 0.000725 0.000814 0.003015 0.000995 0.001847 0.0 0.007246 0.000000 0.000000
    2 ExVywGIrOR8UiNtujSXh 1.744554 6 0.096045 0.000000 0.000000 0.001486 0.000000 0.0 0.00000 ... 0.002355 0.003015 0.002339 0.003095 0.005094 0.001045 0.0 0.016304 0.000000 0.010833
    3 KgiNFPOsZn08u9BEhWyx 6.686383 7 0.096045 0.059783 0.852852 0.045407 0.000000 0.0 0.00000 ... 0.002340 0.004713 0.003378 0.004341 0.004207 0.000153 0.0 0.000000 0.004556 0.008754
    4 29NR1zBEDCPM5xntsdlA 4.044003 1 0.107345 0.090950 0.000000 0.005302 0.002296 0.0 0.00000 ... 0.081379 0.044784 0.065354 0.081915 0.050456 0.024077 0.0 0.043478 0.000000 0.022650

    5 rows × 54 columns

    42. ASM Files - Convert the ASM files to images.

    Back to the top

    In [ ]:
    %%time
    
    import numpy as np
    import os
    import codecs
    import imageio
    import array
    from datetime import datetime as dt
    
    if not os.path.isdir(root_path + "image_file_asm"):
        os.mkdir(root_path + "image_file_asm")
    
    asmfile_list=os.listdir(root_path + "asmFiles/")
    
    # Function to extract images from ASM files and save them to a specified folder (the second arg to the func)
    def extract_images_from_text(arr_of_filenames, folder_to_save_generated_images):  
        for file_name in tqdm(arr_of_filenames):
            
            if(file_name.endswith("asm")):
                this_file = codecs.open(root_path + "asmFiles/" + file_name, 'rb')
                size_of_current_asm_file = os.path.getsize(root_path + "asmFiles/"+file_name)        
            
            width_of_file = int(size_of_current_asm_file**0.5)
            
            remainder = size_of_current_asm_file % width_of_file
            
            # To create array of single bytes, passing type code 'B'
            # "B" is for unsigned characters
            array_of_image = array.array('B')
            
            array_of_image.fromfile(this_file, size_of_current_asm_file-remainder)
            
            this_file.close()
            
            arr_of_generated_image = np.reshape(array_of_image[:width_of_file * width_of_file], (width_of_file, width_of_file))
            
            arr_of_generated_image = np.uint8(arr_of_generated_image)
            
            imageio.imwrite(folder_to_save_generated_images+'/' + file_name.split(".")[0] + '.png', arr_of_generated_image)
            
            
    # Now invoke the above function
    
    directory_to_save_generated_image = root_path + 'image_file_asm'
    
    extract_images_from_text(asmfile_list, directory_to_save_generated_image)
    
    100%|██████████| 10868/10868 [2:21:29<00:00,  1.28it/s]
    CPU times: user 2h 3min 31s, sys: 24.8 s, total: 2h 3min 56s
    Wall time: 2h 21min 29s
    
    
    

    43. Extract the first 800 pixel data from ASM File Images

    Back to the top

    Load each ASM image > Convert it into a numpy array > Take the first 800 pixels from each image.

    In [ ]:
    file_list_asm_files=os.listdir(root_path + 'image_file_asm/')
    
    with open(root_path + "featurization/top_800_image_asm_df.csv", mode='w') as top_800_image_asm_df: #file_list_asm_files = 10868, top_800_image_asm_df=800
        # top_800_image_asm_df.write(','.join(map(str, ["ID"]+["pixel_asm{}".format(i) for i in range(800)])))
        top_800_image_asm_df.write(','.join(map(str, ["ID"]+["pixel_asm{}".format(i) for i in range(10)])))
        top_800_image_asm_df.write('\n')
        
        for image in tqdm(file_list_asm_files):
            file_id_asm_files=image.split(".")[0]
            
             # Create a 2 Matrix to contain the image matrix in 2D format
            asm_image_array=imageio.imread(root_path + "image_file_asm/"+image)
            
            # Extracting from flattened array the first 800 pixels 
            # asm_image_array=asm_image_array.flatten()[:800]
            asm_image_array=asm_image_array.flatten()[:10]
            top_800_image_asm_df.write(','.join(map(str, [file_id_asm_files]+list(asm_image_array))))
            top_800_image_asm_df.write('\n')
    
    100%|█████████▉| 10866/10868 [12:21<00:00,  7.70it/s]
    
    In [ ]:
    %%time
    
    top_800_image_asm_df=pd.read_csv(root_path + "featurization/top_800_image_asm_df.csv")
    top_800_image_asm_df.head()
    
    CPU times: user 12.2 ms, sys: 79 µs, total: 12.3 ms
    Wall time: 11.3 ms
    
    Out[ ]:
    ID pixel_asm0 pixel_asm1 pixel_asm2 pixel_asm3 pixel_asm4 pixel_asm5 pixel_asm6 pixel_asm7 pixel_asm8 pixel_asm9
    0 amGeXDTwCldUsVBHPqkr 72 69 65 68 69 82 58 48 48 52
    1 Izbos3ZTyWKju6k514NY 46 122 101 110 99 58 48 48 52 48
    2 JhUKcAYwftEjC1PuDgIe 72 69 65 68 69 82 58 48 48 52
    3 dcBsUgJNZa9tHkVISRXQ 72 69 65 68 69 82 58 48 48 52
    4 01jsnpXSAlgw6aPeDxrU 72 69 65 68 69 82 58 48 48 52

    44. Extracting Opcodes Bigrams from ASM Files

    Back to the top

    We know that the asm files contain assembly language code which comprises keywords, opcodes, registers, APIs.

    In [ ]:
    %%time
    
    opcodes_for_bigram = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
    
    # Converting list to dictionary for faster runtime
    dict_asm_opcodes = dict(zip(opcodes_for_bigram, [1 for i in range(len(opcodes_for_bigram))]))
    
    if not os.path.isdir(root_path + "opcodes_asm_files"):
        os.mkdir(root_path + 'opcodes_asm_files')
    
    '''
    
    Noting first that the asm files contains :
    
    1. Address
    2. Segments
    3. Opcodes
    4. Registers
    5. function calls
    6. APIs
    
    Calculating opcode sequences for each asm file and save in form of a text file, so that we can process the ASM files as text files
    
    In that text file, each row corresponds to respective file. 
    
    Noting, in asm files the opcodes_for_bigram were not placed side by side, instead there are few words between two opcodes. i.e. The Opcodes occurs with an interval.
    
    So during extraction of opcodes_for_bigram we need to preserve the sequence information. 
    
    e.g. which opcode prcede another opcode or which opcode is followed is followed by another opcode.
    
    Based on this, a bigram data-matrix of vectors is to be derived containing the bigram sequence info on each file.
    
    '''
    
    def calculate_sequence_of_opcodes():
        asm_file_names=os.listdir(root_path + 'asmFiles')
        for this_asm_file in tqdm(asm_file_names):
            each_asm_opcode_file = open(root_path + "opcodes_asm_files/{}_opcode_asm_bi_grams.txt".format(this_asm_file.split('.')[0]), "w+")
            sequence_of_opcodes = ""
            with codecs.open(root_path + 'asmFiles/' + this_asm_file, encoding='cp1252', errors ='replace') as asm_file:
                for lines in asm_file:
                    
                    line = lines.rstrip().split()            
                    
                    for word in line:
                        if dict_asm_opcodes.get(word)==1:
                            sequence_of_opcodes += word + ' '
            each_asm_opcode_file.write(sequence_of_opcodes + "\n")
            each_asm_opcode_file.close()
        
    calculate_sequence_of_opcodes()
    
    opcodes_asm__bigram_vocabulary = calculate_bigram(opcodes_for_bigram)
    
    100%|██████████| 10868/10868 [2:55:58<00:00,  1.03it/s] 
    100%|██████████| 26/26 [00:00<00:00, 72267.66it/s]
    CPU times: user 2h 50min 52s, sys: 2min 4s, total: 2h 52min 56s
    Wall time: 2h 55min 58s
    
    
    

    45. Calcualte opcodes bigram with above defined function and make them a feature and then save the data matrix of feature as a .csv file

    Back to the top

    In [ ]:
    vectorizer_opcode = CountVectorizer(
        tokenizer=lambda x: x.split(),
        lowercase=False,
        ngram_range=(2, 2),
        vocabulary=opcodes_asm__bigram_vocabulary,
    )  # Noting, without "tokenizer=lambda x: x.split()", "??" would not get vectorized correctly
    
    file_list_opcode = os.listdir(root_path + "opcodes_asm_files")
    
    opcode_features = ["ID"] + vectorizer_opcode.get_feature_names()
    
    opcodes_asm_bigram_df = pd.DataFrame(columns=opcode_features)
    
    with open(
        root_path + "featurization/opcodes_asm_bigram_df.csv", mode="w"
    ) as opcodes_asm_bigram_df:
    
        opcodes_asm_bigram_df.write(",".join(map(str, opcode_features)))
    
        opcodes_asm_bigram_df.write("\n")
    
        for _, this_asm_file in tqdm(enumerate(file_list_opcode)):
    
            this_file_id = this_asm_file.split("_")[0]  # ID of each this_asm_file
    
            this_asm_file = open(root_path + "opcodes_asm_files/" + this_asm_file)
    
            corpus_opcodes_from_this_asm_file = [
                this_asm_file.read().replace("\n", " ").lower()
            ]  # Variable to hold all opcodes for a given this_asm_file
    
            bigrams_opcodes_asm = vectorizer_opcode.transform(
                corpus_opcodes_from_this_asm_file
            )  # Returning a sparse vector holding all bigram counts from corpus_opcodes_from_this_asm_file
    
            # Update each row of the dataframe with the bigram counts of the respective this_asm_file
            # And return a dense ndarray representation of this matrix. Because,
            # CountVectorizer produces a sparse representation of the counts using scipy.sparse.csr_matrix
            row = scipy.sparse.csr_matrix(bigrams_opcodes_asm).toarray()
    
            opcodes_asm_bigram_df.write(
                ",".join(map(str, [this_file_id] + list(row[0])))
            )  # Write a single row in the CSV this_asm_file
    
            opcodes_asm_bigram_df.write("\n")
    
            this_asm_file.close()
    
    
    opcodes_asm_bigram_df = pd.read_csv(
        root_path + "featurization/opcodes_asm_bigram_df.csv"
    )
    
    opcodes_asm_bigram_df.head()
    
    10868it [02:17, 78.86it/s] 
    
    Out[ ]:
    ID jmp jmp jmp mov jmp retf jmp push jmp pop jmp xor jmp retn jmp nop jmp sub ... movzx cmp movzx call movzx shl movzx ror movzx rol movzx jnb movzx jz movzx rtn movzx lea movzx movzx
    0 hdGzSti0HYqTLaZ4lrmv 0 74 0 79 0 20 0 0 4 ... 3 0 1 0 0 0 0 0 1 3
    1 BjaLF1KHchGQlUrYO6fn 0 13 0 7 0 5 0 0 1 ... 0 0 0 0 0 0 0 0 0 0
    2 F7MWGPqic0rbdlKCImEk 0 63 0 39 0 19 0 0 1 ... 4 0 1 0 0 2 3 0 7 8
    3 8cqDWHrnyKRC2Ja9biQ5 380 62 0 24 0 11 1 0 26 ... 0 0 0 0 0 0 0 0 0 0
    4 5RUlGkJf6XHI3M2L18tx 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

    5 rows × 677 columns

    46. ASM File - Top Important 500 features from Opcodes Bigrams

    Back to the top

    In [ ]:
    X_opcode_asm_bigram = opcodes_asm_bigram_df
    y = class_labels
    # X_opcode_asm_bigram.head()
    
    #Get the best 500 features using SelectKBest. 
    
    
    kbest_object = SelectKBest(score_func=chi2, k=500)
    
    top_features=kbest_object.fit(X_opcode_asm_bigram.drop("ID", axis=1), y)
    
    # Save a dataframe with the feature scores along with the feature names.
    # And we will get the best fetures from this dataframe use to 
    top_features_scores=pd.DataFrame(top_features.scores_)
    
    # Now to get the original features names i.e. the names of all the columns we will need
    # `X_opcode_asm_bigram.columns`
    X_opcode_columns=pd.DataFrame(X_opcode_asm_bigram.columns)
    
    # Now concat all  original features names as a column with another column
    # which is "top_features_scores"
    top_asm_opcode_bigram_df=pd.concat([X_opcode_columns, top_features_scores],axis=1)
    
    # Give 2 Names for these 2 columns of data for this newly creaetd dataframe
    top_asm_opcode_bigram_df.columns=["ASM_Opcode_Bigram_Top_Feature_Name","ASM_Opcode_Bigram_Top_Feature_Score"]
    
    # Extract the largest 500 from this dataframw based on the values of "top_features_scores"
    top_asm_opcode_bigram_df=top_asm_opcode_bigram_df.nlargest(500,"ASM_Opcode_Bigram_Top_Feature_Score")
    
    top_asm_opcode_bigram_df.head()
    
    Out[ ]:
    ASM_Opcode_Bigram_Top_Feature_Name ASM_Opcode_Bigram_Top_Feature_Score
    189 nop retn 218957.002515
    27 mov jmp 185301.310475
    319 imul retn 119795.693264
    183 nop jmp 116970.779035
    33 mov retn 106036.092701
    In [ ]:
    top_500_asm_bigram_features=list(top_asm_opcode_bigram_df["ASM_Opcode_Bigram_Top_Feature_Name"])
    
    top_500_asm_bigram_df=pd.concat([X_opcode_asm_bigram["ID"], X_opcode_asm_bigram[top_500_asm_bigram_features]], axis=1)
    
    # The "ID" column was being duplicated, hence need to remove that, and also the possibility of any other duplicated column
    top_500_asm_bigram_df = top_500_asm_bigram_df.loc[:,~top_500_asm_bigram_df.columns.duplicated()]
    
    top_500_asm_bigram_df.to_csv(root_path + "featurization/featurization_final/top_500_asm_opcodes_bigram_df.csv",index=None)
    
    top_500_asm_bigram_df.head()
    
    Out[ ]:
    ID nop retn mov jmp imul retn nop jmp mov retn nop add xor retn nop pop push retf ... imul rol retn ror retf add cmp inc shr shl rol xor xchg lea cmp call nop lea add movzx
    0 hdGzSti0HYqTLaZ4lrmv 0 82 0 0 79 0 21 0 0 ... 0 0 0 1 0 0 0 2 0 0
    1 BjaLF1KHchGQlUrYO6fn 0 20 0 0 8 1 1 0 0 ... 0 0 0 3 0 0 0 1 0 0
    2 F7MWGPqic0rbdlKCImEk 0 88 0 0 6 0 3 0 0 ... 0 0 0 26 1 0 0 4 0 31
    3 8cqDWHrnyKRC2Ja9biQ5 0 86 0 0 65 0 4 0 0 ... 0 0 0 1 0 0 0 7 0 0
    4 5RUlGkJf6XHI3M2L18tx 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

    5 rows × 500 columns

    47. Opcodes Trigrams ASM Files - Feature extraction

    Back to the top

    In [ ]:
    # Function to return all possible n*n*n combinations of trigrams
    def calculate_trigram(tokens):
        sent = ""
        trigram_result = []
        for i in range(len(tokens)):
            for j in range(len(tokens)):
                for k in range(len(tokens)):
                    trigram = tokens[i] + " " + tokens[j] + " " + tokens[k]
                    trigram_result.append(trigram)
        return trigram_result
      
    
    # test_tokens=['edx','esi','eax']
    # trigram_result = calculate_trigram(test_tokens)
    # trigram_result
    
    In [ ]:
    opcodes_trigram = ['jmp', 'mov', 'retf', 'push', 'pop', 'xor', 'retn', 'nop', 'sub', 'inc', 'dec', 'add','imul', 'xchg', 'or', 'shr', 'cmp', 'call', 'shl', 'ror', 'rol', 'jnb','jz','rtn','lea','movzx']
    
    opcodes_trigram_asm_vocabulary = calculate_trigram(
        opcodes_trigram
    )  # Holding all n*n*n possible combinations of trigrams_from_asm_files
    
    vectorizer = CountVectorizer(
        tokenizer=lambda x: x.split(),
        lowercase=False,
        ngram_range=(3, 3),
        vocabulary=opcodes_trigram_asm_vocabulary,
    )  # NOTE: without "tokenizer=lambda x: x.split()", "??" would not get vectorized properly
    
    file_lists_asm_opcodes = os.listdir(root_path + "opcodes_asm_files")
    
    features = ["ID"] + vectorizer.get_feature_names()
    
    opcodes_asm_trigram_df = pd.DataFrame(columns=features)
    
    with open(
        root_path + "featurization/opcodes_asm_trigram_df.csv", mode="w"
    ) as opcodes_asm_trigram_df:
        
        opcodes_asm_trigram_df.write(",".join(map(str, features)))
        
        opcodes_asm_trigram_df.write("\n")
        
        for _, current_asm_textized_file in tqdm(enumerate(file_lists_asm_opcodes)):
            each_file_id = current_asm_textized_file.split("_")[0]
            current_asm_textized_file = open(
                root_path + "opcodes_asm_files/" + current_asm_textized_file
            )
            corpus_for_asm_files_opcodes = [
                current_asm_textized_file.read().replace("\n", " ").lower()
            ]  # This will contain all the opcodes_trigram codes for a given current_asm_textized_file
    
            # CountVectorizer produces a sparse representation of the counts using scipy.sparse.csr_matrix.
            # Hence below is a sparse vector of all trigram counts from corpus_for_asm_files_opcodes
            trigrams_from_asm_files = vectorizer.transform(corpus_for_asm_files_opcodes)
    
            # So now return a dense ndarray representation of this matrix
            # Updating each row_trigram_count of the dataframe with trigram counts
            # of corresponding current_asm_textized_file
            row_trigram_count = scipy.sparse.csr_matrix(trigrams_from_asm_files).toarray()
    
            # Write that single row in the CSV for current_asm_textized_file
            opcodes_asm_trigram_df.write(
                ",".join(map(str, [each_file_id] + list(row_trigram_count[0])))
            )
    
            opcodes_asm_trigram_df.write("\n")
    
            current_asm_textized_file.close()
    
    
    opcodes_asm_trigram_df = pd.read_csv(
        root_path + "featurization/opcodes_asm_trigram_df.csv"
    )
    opcodes_asm_trigram_df.head()
    
    10868it [02:08, 84.59it/s] 
    
    Out[ ]:
    ID jmp jmp jmp jmp jmp mov jmp jmp retf jmp jmp push jmp jmp pop jmp jmp xor jmp jmp retn jmp jmp nop jmp jmp sub ... movzx movzx cmp movzx movzx call movzx movzx shl movzx movzx ror movzx movzx rol movzx movzx jnb movzx movzx jz movzx movzx rtn movzx movzx lea movzx movzx movzx
    0 hdGzSti0HYqTLaZ4lrmv 0 0 0 0 0 0 0 0 0 ... 1 0 0 0 0 0 0 0 0 1
    1 BjaLF1KHchGQlUrYO6fn 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
    2 F7MWGPqic0rbdlKCImEk 0 0 0 0 0 0 0 0 0 ... 2 0 0 0 0 0 0 0 2 0
    3 8cqDWHrnyKRC2Ja9biQ5 377 0 0 0 0 2 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
    4 5RUlGkJf6XHI3M2L18tx 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

    5 rows × 17577 columns

    48. ASM File - Top Important 800 features from Opcodes Trigrams

    Back to the top

    This will be the same sequence of steps what we applied earlier for extracting top 500 Features from ASM bigrams.

    In [ ]:
    %%time 
    
    X_opcode_asm_trigram = opcodes_asm_trigram_df
    y = class_labels
    # X_opcode_asm_trigram.head()
    
    #Get the best 500 features using SelectKBest. Save the feature scores along with the feature names in a feature_score_df_df, which we will use to get the best fetures from the bigrams df data
    
    kbest_object = SelectKBest(score_func=chi2, k=800)
    
    top_features=kbest_object.fit(X_opcode_asm_trigram.drop("ID", axis=1), y)
    
    top_features_scores=pd.DataFrame(top_features.scores_)
    
    X_opcode_columns=pd.DataFrame(X_opcode_asm_trigram.columns)
    
    top_asm_opcode_trigram_df=pd.concat([X_opcode_columns,top_features_scores],axis=1)
    
    top_asm_opcode_trigram_df.columns=["ASM_Opcode_Top_Feature_Name","ASM_Opcode_Top_Feature_Score"]
    
    top_asm_opcode_trigram_df=top_asm_opcode_trigram_df.nlargest(800,"ASM_Opcode_Top_Feature_Score")
    
    top_asm_opcode_trigram_df.head()
    
    CPU times: user 3.64 s, sys: 176 ms, total: 3.81 s
    Wall time: 3.81 s
    
    Out[ ]:
    ASM_Opcode_Top_Feature_Name ASM_Opcode_Top_Feature_Score
    703 mov mov jmp 136153.690581
    8301 imul nop retn 82471.139719
    4921 nop nop retn 75176.223576
    4915 nop nop jmp 60993.143342
    4765 nop mov retn 56078.808690
    In [ ]:
    %%time
    
    # Get List of the 800 top features
    top_800_asm_trigram_features=list(top_asm_opcode_trigram_df["ASM_Opcode_Top_Feature_Name"])
    
    top_800_asm_trigam_df=pd.concat([X_opcode_asm_trigram["ID"], X_opcode_asm_trigram[top_800_asm_trigram_features]], axis=1)
    
    # The "ID" column was being duplicated, hence need to remove that, and also the possibility of any other duplicated column
    top_800_asm_trigam_df = top_800_asm_trigam_df.loc[:,~top_800_asm_trigam_df.columns.duplicated()]
    
    top_800_asm_trigam_df.to_csv(root_path + "featurization/featurization_final/top_800_asm_opcodes_trigram_df.csv",index=None)
    
    top_800_asm_trigam_df.head()
    
    CPU times: user 916 ms, sys: 16 ms, total: 932 ms
    Wall time: 1.08 s
    
    Out[ ]:
    ID mov mov jmp imul nop retn nop nop retn nop nop jmp nop mov retn nop imul retn nop nop add jnb jnb rol mov nop retn ... add jmp shr cmp mov retf jz jz lea mov dec movzx mov shr xchg add jmp movzx xor mov or mov dec sub add mov dec mov cmp shr
    0 hdGzSti0HYqTLaZ4lrmv 45 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0
    1 BjaLF1KHchGQlUrYO6fn 7 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 1 0
    2 F7MWGPqic0rbdlKCImEk 35 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 2 0 0 10 1
    3 8cqDWHrnyKRC2Ja9biQ5 12 0 0 0 0 0 0 0 0 ... 0 0 1 0 0 0 1 1 0 0
    4 5RUlGkJf6XHI3M2L18tx 0 0 0 0 0 0 0 0 0 ... 0 0 0 0 0 0 0 0 0 0

    5 rows × 800 columns

    49. Final Merging of all Features for the Final XGBOOST Training

    Back to the top

    - Unigram of Byte Files + Size of Byte Files +

    - Top 52 Unigram of ASM Files + Size of ASM Files

    - Top 2000 Bi-Gram of Byte files +

    - Top 500 Bigram of Opcodes of ASM Files

    - Top 800 Trigram of Opcodes of ASM Files

    - Top 800 ASM Image Features

    In [ ]:
    %%time
    
    # Unigram of Byte Files + Size of Byte Files + 
    uni_gram_byte_features__with_size = pd.read_csv(
        root_path + "featurization/uni_gram_byte_features__with_size.csv"
    )
    
    # Top 52 Unigram of ASM Files  + Size of ASM Files
    # Droping .BSS, .rtn, .CODE features from the unigram_asm_feature__with_size (which is the unigram of asm files) dataset
    # As we earlier saw that these features were not much important in separating class labels
    unigram_asm_feature__with_size = pd.read_csv(
        root_path + "featurization/unigram_asm_feature__with_size"
    ).drop(["Class", "rtn", ".BSS:", ".CODE"], axis=1)
    
    # Top 2000 Bi-Gram of Byte files
    # top_2000_imp_byte_bigram_df = pd.read_csv(
    #     root_path + "featurization/featurization_final/top_2000_imp_byte_bigram_df.csv"
    # ).drop(columns=["ID.1"])
    
    top_2000_imp_byte_bigram_df = pd.read_csv(
        root_path + "featurization/featurization_final/top_2000_imp_byte_bigram_df.csv"
    )
    
    # Top 500 Bigram of Opcodes of ASM Files
    top_500_asm_bigram_df = pd.read_csv(root_path + "featurization/featurization_final/top_500_asm_opcodes_bigram_df.csv")
    
    
    # Top 800 Trigram of Opcodes of ASM Files
    top_800_asm_trigam_df = pd.read_csv(root_path + "featurization/featurization_final/top_800_asm_opcodes_trigram_df.csv")
    
    # Top 800 ASM Image Features
    top_800_image_asm_df = pd.read_csv(root_path + "featurization/top_800_image_asm_df.csv")
    
    CPU times: user 883 ms, sys: 0 ns, total: 883 ms
    Wall time: 897 ms
    
    In [ ]:
    %%time
    
    # Initiate a dataframe for representing the Combined Features
    # and set it equal to uni_gram_byte_features__with_size
    combined_features_final_df = uni_gram_byte_features__with_size
    
    individual_featuarized_dfs = [
        unigram_asm_feature__with_size,
        top_800_image_asm_df,
        top_2000_imp_byte_bigram_df,
        top_500_asm_bigram_df,
        top_800_asm_trigam_df
    ]
    
    for df in tqdm(individual_featuarized_dfs):
        # combined_features_final_df = pd.merge(combined_features_final_df, df, on="ID", how="left")
        combined_features_final_df = pd.merge(combined_features_final_df, df, on="ID")
    
    combined_features_final_df.to_csv(
        root_path + "featurization/featurization_final/combined_features_final_df.csv",
        index=None,
    )
    
    combined_features_final_df.head()
    
    100%|██████████| 5/5 [00:00<00:00, 33.92it/s]
    
    CPU times: user 2.9 s, sys: 79.8 ms, total: 2.98 s
    Wall time: 3.41 s
    
    Out[ ]:
    ID 0 1 2 3 4 5 6 7 8 ... add jmp shr cmp mov retf jz jz lea mov dec movzx mov shr xchg add jmp movzx xor mov or mov dec sub add mov dec mov cmp shr
    0 1ESMN0Gc6wRmC9BFPjWy 11548 5532 3238 3364 3256 3321 3159 3312 3275 ... 0 0 0 0 0 0 0 0 0 0
    1 82rMDRO53qpfnIL4Hi1Y 19812 698 322 445 564 385 228 234 377 ... 0 0 0 0 0 0 0 0 0 0
    2 cNqPy69uQHgF3DOU14G7 11488 5376 3182 3248 3376 3171 3330 3209 3332 ... 0 0 0 0 0 0 0 0 0 0
    3 cwQYBjsoDvAz5MNK8nCR 90551 5907 2520 6791 5193 1356 1485 1339 4178 ... 0 0 2 0 0 5 5 1 6 0
    4 6T907yrYp4XJsGPk82Kh 11443 5636 3167 3375 3374 3347 3247 3303 3273 ... 0 0 0 0 0 0 0 0 0 0

    5 rows × 1628 columns

    50. Final Train Test Split. 64% Train, 16% Cross Validation, 20% Test

    Back to the top

    In [ ]:
    combined_features_final_df = pd.read_csv(root_path + "featurization/featurization_final/combined_features_final_df.csv")
    
    combined_features_final_df_normalized = normalize(combined_features_final_df)
    
    combined_features_final_df_normalized.to_csv(root_path + "featurization/featurization_final/combined_features_final_df_normalized.csv", index=None)
    
    In [9]:
    %%time
    
    final_X = pd.read_csv(root_path + "featurization/featurization_final/combined_features_final_df_normalized.csv").fillna(0).drop(['ID'], axis=1)
    
    final_y = pd.read_csv(root_path + "featurization/featurization_final/combined_features_final_df_normalized.csv")["Class"]
    
    
    # Splitting - Keep same distribution of class label 'y_true' with [stratify=final_y]
    X_train, X_test_final_merged, y_train, y_test_final_merged = train_test_split(final_X, final_y, stratify=final_y, test_size=0.20, random_state=42)
    
    X_train_final_merged, X_cv_final_merged, y_train_final_merged, y_cv_final_merged = train_test_split(X_train, y_train, stratify=y_train, test_size=0.20, random_state=42)
    
    print('Shape of X_train_final_merged and y_train_final_merged: ', X_train_final_merged.shape, y_train_final_merged.shape)
    
    print('Shape of X_test_final_merged and y_test_final_merged: ', X_test_final_merged.shape, y_test_final_merged.shape)
    
    print('Shape of X_cv_final_merged and y_cv_final_merged ', X_cv_final_merged.shape, y_cv_final_merged.shape)
    
    Shape of X_train_final_merged and y_train_final_merged:  (6955, 1627) (6955,)
    Shape of X_test_final_merged and y_test_final_merged:  (2174, 1627) (2174,)
    Shape of X_cv_final_merged and y_cv_final_merged  (1739, 1627) (1739,)
    CPU times: user 4.52 s, sys: 210 ms, total: 4.73 s
    Wall time: 4.88 s
    

    51. Final XGBoost Training - Hyperparameter tuning with on Final Merged Data-Matrix

    Back to the top

    In [10]:
    %%time
    
    xgb_clf=XGBClassifier()
    
    prams={
        'learning_rate':[0.01,0.03,0.05,0.1,0.15,0.2],
         'n_estimators':[100,200,500,1000,2000],
         'max_depth':[3,5,10],
        'colsample_bytree':[0.1,0.3,0.5,1],
        'subsample':[0.1,0.3,0.5,1],
        'tree_method':['gpu_hist']
    }
    
    random_clf=RandomizedSearchCV(xgb_clf, param_distributions=prams, verbose=10, n_jobs=-1)
    
    random_clf.fit(X_train_final_merged, y_train_final_merged)
    
    print(random_clf.best_params_)
    
    Fitting 5 folds for each of 10 candidates, totalling 50 fits
    
    [Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
    [Parallel(n_jobs=-1)]: Done   5 tasks      | elapsed:   28.7s
    [Parallel(n_jobs=-1)]: Done  10 tasks      | elapsed:  1.0min
    [Parallel(n_jobs=-1)]: Done  17 tasks      | elapsed:  4.7min
    [Parallel(n_jobs=-1)]: Done  24 tasks      | elapsed:  8.1min
    [Parallel(n_jobs=-1)]: Done  33 tasks      | elapsed:  9.1min
    [Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed: 12.0min
    [Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed: 14.3min finished
    
    {'tree_method': 'gpu_hist', 'subsample': 0.5, 'n_estimators': 1000, 'max_depth': 10, 'learning_rate': 0.03, 'colsample_bytree': 1}
    CPU times: user 27.4 s, sys: 1.04 s, total: 28.4 s
    Wall time: 14min 40s
    

    Best Params we got from above RandomizedSearchCV

    52. Final running of XGBoost with the Best HyperParams that we got from above RandomizedSearchCV

    Back to the top

    In [14]:
    %%time
    
    n_estimators = random_clf.best_params_['n_estimators']
    subsample = random_clf.best_params_['subsample']
    max_depth = random_clf.best_params_['max_depth']
    learning_rate = random_clf.best_params_['learning_rate']
    colsample_bytree = random_clf.best_params_['colsample_bytree']
    tree_method = random_clf.best_params_['tree_method']
    
    # print(tree_method)
    
    x_clf_with_best_hyper_param=XGBClassifier(n_estimators=n_estimators, max_depth=max_depth, learning_rate= learning_rate, colsample_bytree=colsample_bytree, subsample=subsample, tree_method=tree_method, nthread=-1)
    
    x_clf_with_best_hyper_param.fit(X_train_final_merged, y_train_final_merged, verbose=True)
    
    sig_clf = CalibratedClassifierCV(x_clf_with_best_hyper_param, method="sigmoid")
    
    sig_clf.fit(X_train_final_merged, y_train_final_merged)
    
    CPU times: user 2min 14s, sys: 1.75 s, total: 2min 16s
    Wall time: 2min 15s
    
    In [15]:
    %%time
    
    n_estimators = random_clf.best_params_['n_estimators']
    
    # LOGLOSS FOR TRAIN
    
    predict_y_train = sig_clf.predict_proba(X_train_final_merged)
    
    print ('With best number of estimators = ', n_estimators, "Our train log loss is:", log_loss(y_train_final_merged, predict_y_train))
    
    
    # LOGLOSS FOR TEST
    
    predict_y_test = sig_clf.predict_proba(X_test_final_merged)
    
    print('For values of best number of estimators = ', n_estimators, "The test log loss is:", log_loss(y_test_final_merged, predict_y_test))
    
    
    # LOGLOSS FOR CV
    
    predict_y_cv = sig_clf.predict_proba(X_cv_final_merged)
    
    print('With best number of estimators = ', n_estimators, "Our cross validation log loss is:", log_loss(y_cv_final_merged, predict_y_cv))
    
    With best number of estimators =  1000 Our train log loss is: 0.007050475486378417
    For values of best number of estimators =  1000 The test log loss is: 0.007045815985166401
    With best number of estimators =  1000 Our cross validation log loss is: 0.008472351060361807
    CPU times: user 2.72 s, sys: 65.8 ms, total: 2.78 s
    Wall time: 2.76 s
    

    Imgur

    53. Possibiliy of Further Analysis and Featurizition

    Back to the top

    We could experiment further with following features.

    • names of the imported functions
    • Libraries used
    • the number of procedures used.
    • Computing the ”constitionality” of the executable: the number of ”loc *” references in .asm file.
    • According to some papers, the malware content is often encrypted inside the binary, so we could introduce some encryption related features.
    • Computing entropy over a sliding window of half-byte sequences,
    • Extracting some statistics of its distribution (20 quantiles, 20 percentiles, mean, median, std, max, min, max-min),
    • Statistics of first order differences distribution and parts of the entropy sequence.
    • Computing compression ratio (as an approximation of Kolmogorov complexity).